Context Navigation

crawling-Nutch.txt@ 33565

Last change on this file since 33565 was 33565, checked in by ak19, 5 years ago

CCWETProcessor: domain url now goes in as a seedURL after the individual seedURLs, after Dr Bainbridge explained why the original ordering didn't make sense. 2. conf: we inspected the first site to be crawled. It was a non-top site, but we still wanted to control the crawling of it in the same way we control topsites. 3. Documented use of the nutch command for testing which urls pass and fail the existing regex-urlfilter checks.

File size: 14.9 KB

Line
1	https://codereview.stackexchange.com/questions/198343/crawl-and-gather-all-the-urls-recursively-in-a-domain
2	http://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
3
4	https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
5
6	https://cwiki.apache.org/confluence/display/nutch/
7	https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling
8	https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
9
10	https://moz.com/top500
11	-----------
12	NUTCH
13	-----------
14	https://stackoverflow.com/questions/35449673/nutch-and-solr-indexing-blacklist-domain
15	https://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/urlfilter/domainblacklist/DomainBlacklistURLFilter.html
16
17	https://lucene.472066.n3.nabble.com/blacklist-for-crawling-td618343.html
18	https://lucene.472066.n3.nabble.com/Content-of-size-X-was-truncated-to-Y-td4003517.html
19
20
21	Google: nutch mirror web site
22	https://stackoverflow.com/questions/33354460/nutch-clone-website
23	[https://stackoverflow.com/questions/35714897/nutch-not-crawling-entire-website
24	fetch -all seems to be a nutch v2 thing?]
25
26	Google (30 Sep): site mirroring with nutch
27	https://grokbase.com/t/nutch/user/125sfbg0pt/using-nutch-for-web-site-mirroring
28	https://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
29	http://www.cs.ucy.ac.cy/courses/EPL660/lectures/lab6.pdf
30	slide p.5 onwards
31
32	crawler softw options: https://repositorio.iscte-iul.pt/bitstream/10071/2871/1/Building%20a%20Scalable%20Index%20and%20Web%20Search%20Engine%20for%20Music%20on.pdf
33	See also p.20. HTTrack
34
35
36	Google: nutch performance tuning
37	* https://stackoverflow.com/questions/24383212/apache-nutch-performance-tuning-for-whole-web-crawling
38	* https://stackoverflow.com/questions/4871972/how-to-speed-up-crawling-in-nutch
39	* https://cwiki.apache.org/confluence/display/nutch/OptimizingCrawls
40
41	NUTCH INSTALLATION:
42	* Nutch v1: https://cwiki.apache.org/confluence/display/nutch/NutchTutorial#NutchTutorial-SetupSolrforsearch
43
44	Nutch v2 installation and set up:
45	* https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial
46	* https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783286850/1/ch01lvl1sec09/installing-and-configuring-apache-nutch
47
48
49	Nutch doesn't work with spark (yet):
50	https://stackoverflow.com/questions/29950299/distributed-web-crawling-using-apache-spark-is-it-possible
51
52	SOLR:
53	* Query syntax: http://www.solrtutorial.com/solr-query-syntax.html
54	* Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html
55
56
57	* If you change a nutch 2 configuration, https://stackoverflow.com/questions/16401667/java-lang-classnotfoundexception-org-apache-gora-hbase-store-hbasestore
58	explains you can rebuild nutch with:
59	cd <apache-nutch>
60	ant clean
61	ant runtime
62	----------------------------------
63	Apache Nutch 2 with newer HBase
64
65	hbase-common-1.4.8.jar
66
67	1. hbase jar files need to go into runtime/local/lib
68
69	But not slf4j-log4j12-1.7.10.jar (there's already a slf4j-log4j12-1.7.5.jar) - so remove that from runtime/local/lib after copying it over.
70
71	2. https://stackoverflow.com/questions/46340416/how-to-compile-nutch-2-3-1-with-hbase-1-2-6
72	https://stackoverflow.com/questions/39834423/apache-nutch-fetcherjob-throws-nosuchelementexception-deep-in-gora/39837926#39837926
73
74	Unfortunately, the page https://paste.apache.org/jjqz referred to above that contains patches for using Gora 0.7 is no longer available.
75
76	http://mail-archives.apache.org/mod_mbox/nutch-user/201602.mbox/%[email protected]%3E
77
78	https://www.mail-archive.com/[email protected]/msg14245.html
79
80	------------------------------------------------------------------------------
81	Other way: Nutch on its own vagrant with specified hbase or nutch with mongodb
82	------------------------------------------------------------------------------
83	* https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/
84	* https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
85
86	The older but recommended hbase 0.98.21 for hadoop 2 can be downloaded from https://archive.apache.org/dist/hbase/0.98.21/
87
88	-----
89	HBASE commands
90	/usr/local/hbase/bin/hbase shell
91	https://learnhbase.net/2013/03/02/hbase-shell-commands/
92	http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/
93	dropping tables: https://www.tutorialspoint.com/hbase/hbase_drop_table.htm
94
95	> list
96
97	davidbHomePage_webpage is a table
98
99	> get 'davidbHomePage_webpage', '1'
100
101	Solution to get a working nutch2:
102	get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz
103	And follow the instructions in my README file in there.
104
105	---------------------------------------------------------------------
106	ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
107	---------------------------------------------------------------------
108	=> https://anarc.at/services/archive/web/
109	Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
110	https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
111	https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
112	To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
113	https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
114	https://alternativeto.net/software/apache-nutch/
115	https://alternativeto.net/software/wget/
116	https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
117	https://github.com/ArchiveTeam/wpull
118
119	-------------------
120
121	Running nutch 2.x
122
123	-------------------
124
125	LINKS
126
127	https://lucene.472066.n3.nabble.com/Nutch-2-x-readdb-command-dump-td4033937.html
128	https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
129
130
131	https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/ ## most useful for running nutch 2.x crawls
132
133	https://www.mobomo.com/2017/06/the-basics-working-with-nutch-2-x/
134	"Fetch
135
136	This is where the magic happens. During the fetch step, Nutch crawls the urls selected in the generate step. The most important argument you need is -threads: this sets the number of fetcher threads per task. Increasing this will make crawling faster, but setting it too high can overwhelm a site and it might shut out your crawler, as well as take up too much memory from your machine. Run it like this:
137	$ nutch fetch -threads 50"
138
139
140	https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-nutch-tutorial/
141	https://www.yegor256.com/2019/04/17/nutch-from-java.html
142
143	http://nutch.sourceforge.net/docs/en/tutorial.html
144
145	Intranet: Configuration
146	To configure things for intranet crawling you must:
147
148	Create a flat file of root urls. For example, to crawl the nutch.org site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like:
149
150	http://www.nutch.org/
151
152	Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read:
153
154	+^http://([a-z0-9]\.)nutch.org/
155
156	This will include any url in the domain nutch.org.
157
158	Intranet: Running the Crawl
159	Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
160
161	-dir dir names the directory to put the crawl in.
162	-depth depth indicates the link depth from the root page that should be crawled.
163	-delay delay determines the number of seconds between accesses to each host.
164	-threads threads determines the number of threads that will fetch in parallel.
165
166	For example, a typical call might be:
167
168	bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
169
170	Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10. <===========
171
172	Once crawling has completed, one can skip to the Searching section below.
173
174
175	-----------------------------------
176	Actually running nutch 2.x - steps
177	-----------------------------------
178	MANUALLY GOING THROUGH THE CYCLE 3 TIMES:
179
180	cd ~/apache-nutch-2.3.1/runtime/local
181
182	./bin/nutch inject urls
183
184	./bin/nutch generate -topN 50
185	./bin/nutch fetch -all
186	./bin/nutch parse -all
187	./bin/nutch updatedb -all
188
189	./bin/nutch generate -topN 50
190	./bin/nutch fetch -all
191	./bin/nutch parse -all
192	./bin/nutch updatedb -all
193
194	./bin/nutch generate -topN 50
195	./bin/nutch fetch -all
196	./bin/nutch parse -all
197	./bin/nutch updatedb -all
198
199	Dump output on local filesystem:
200	rm -rf /tmp/bla
201	./bin/nutch readdb -dump /tmp/bla [-crawlId ID -text]
202	less /tmp/bla/part-r-00000
203
204	To dump output on local filesystem:
205	Need hdfs host name if sending/dumping nutch crawl output to a location on hdfs
206	Host is defined in /usr/local/hadoop/etc/hadoop/core-site.xml for property fs.defaultFS, (https://stackoverflow.com/questions/27956973/java-io-ioexception-incomplete-hdfs-uri-no-host)
207	host is hdfs://node2/ in this case.
208	So:
209
210	hdfs dfs -rmdir /user/vagrant/dump
211	XXX ./bin/nutch readdb -dump user/vagrant/dump -text ### won't work
212	XXX ./bin/nutch readdb -dump hdfs:///user/vagrant/dump -text ### won't work
213	./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump -text
214
215
216	USING THE SCRIPT TO ATTEMPT TO CRAWL A SITE
217	* Choosing to repeat the cycle 10 times because, as per http://nutch.sourceforge.net/docs/en/tutorial.html
218
219	"Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10."
220
221	* Use the ./bin/crawl script, provide the seed urls dir, the crawlId and number of times to repeat = 10
222	vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage 10
223
224
225	* View the downloaded crawls.
226	This time need to provide crawlId to readdb, in order to get a dump of its text contents:
227	hdfs dfs -rm -r hdfs://node2/user/vagrant/dump2
228	./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump2 -text -crawlId davidbHomePage
229
230	* View the contents:
231	hdfs dfs -cat hdfs://node2/user/vagrant/dump2/part-r-*
232
233
234	* FIND OUT NUMBER OF URLS DOWNLOADED FOR THE SITE:
235	vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage
236	WebTable statistics start
237	Statistics for WebTable:
238	retry 0: 44
239	status 5 (status_redir_perm): 4
240	status 3 (status_gone): 1
241	status 2 (status_fetched): 39
242	jobs: {[davidbHomePage]db_stats-job_local647846559_0001={jobName=[davidbHomePage]db_stats, jobID=job_local647846559_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=0, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=595591168}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788140, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
243	TOTAL urls: 44
244	max score: 1.0
245	avg score: 0.022727273
246	min score: 0.0
247	WebTable statistics: done
248
249	------------------------------------
250	STOPPING CONDITION
251	Seems inbuilt
252	* When I tell it to cycle 15 times, it stops after 6 cycles saying no more URLs to fetch:
253
254	vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage2 15
255	---
256	No SOLRURL specified. Skipping indexing.
257	Injecting seed URLs
258
259	...
260
261	Thu Oct 3 09:22:23 UTC 2019 : Iteration 6 of 15
262	Generating batchId
263	Generating a new fetchlist
264	...
265	Generating batchId
266	Generating a new fetchlist
267	/home/vagrant/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId davidbHomePage2 -batchId 1570094569-27637
268	GeneratorJob: starting at 2019-10-03 09:22:49
269	GeneratorJob: Selecting best-scoring urls due for fetch.
270	GeneratorJob: starting
271	GeneratorJob: filtering: false
272	GeneratorJob: normalizing: false
273	GeneratorJob: topN: 50000
274	GeneratorJob: finished at 2019-10-03 09:22:52, time elapsed: 00:00:02
275	GeneratorJob: generated batch id: 1570094569-27637 containing 0 URLs
276	Generate returned 1 (no new segments created)
277	Escaping loop: no more URLs to fetch now
278	vagrant@node2:~/apache-nutch-2.3.1/runtime/local$
279	---
280
281	* running readdb -stats show 44 URLs fetched, just as first time (when crawlId had been "davidbHomePage"):
282
283	vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage2
284	---
285	WebTable statistics start
286	Statistics for WebTable:
287	retry 0: 44
288	status 5 (status_redir_perm): 4
289	status 3 (status_gone): 1
290	status 2 (status_fetched): 39
291	jobs: {[davidbHomePage2]db_stats-job_local985519583_0001={jobName=[davidbHomePage2]db_stats, jobID=job_local985519583_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=4, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=552599552}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788152, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
292	TOTAL urls: 44
293	---
294
295	----------------------------------------------------------------------
296	Testing URLFilters: testing a URL to see if it's accepted
297	----------------------------------------------------------------------
298	Use the command
299	./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
300	(mentioned at https://lucene.472066.n3.nabble.com/Correct-syntax-for-regex-urlfilter-txt-trying-to-exclude-single-path-results-td3600376.html)
301
302	Use as follows:
303
304	cd apache-nutch-2.3.1/runtime/local
305
306	./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
307
308	Then paste the URL you want to test, press Enter.
309	A + in front of response means accepted
310	A - in front of response means rejected.
311	Can continue pasting URLs to test against filters until you send Ctrl-D to terminate input.
312
313
314
315
316

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt@ 33565

Download in other formats: