Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33545

Timestamp:

2019-10-03T22:38:00+13:00 (5 years ago)

Author:

ak19

Message:

Mainly changes to crawling-Nutch.txt and some minor changes to other txt files. crawling-Nutch.txt now documents my attempts to successfully run nutch v2 on the davidb homepage site and crawl it entirely and dump the text output into the local or hadoop filesystem. I also ran 2 different numbers of nutch cycles (generate-fetch-parse-updatedb) to download the site: 10 cycles and 15 cycles. I paid attention to the output the second time, it stopped after 6 cycles saying there was nothing new to fetch. So it seems to have a built-in termination test, allowing site mirroring. Running readdb with the -stats flag allowed me to check that both times, it downloaded 44 URLs.

Location:

gs3-extensions/maori-lang-detection

Files:

: 3 edited

MoreReading/Vagrant-Spark-Hadoop.txt (modified) (1 diff)
MoreReading/crawling-Nutch.txt (modified) (2 diffs)
hdfs-cc-work/GS_README.TXT (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

-              r33499
+              r33545
 https://www.guru99.com/create-your-first-hadoop-program.html
+Some Hadoop commands
+* https://community.cloudera.com/t5/Support-Questions/Closed-How-to-store-output-of-shell-script-in-HDFS/td-p/229933
+* https://stackoverflow.com/questions/26513861/checking-if-directory-in-hdfs-already-exists-or-not
 --------------
 To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:

gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

-              r33541
+              r33545
 /usr/local/hbase/bin/hbase shell
 https://learnhbase.net/2013/03/02/hbase-shell-commands/
 list
+http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/
+> list
 davidbHomePage_webpage is a table
+> get 'davidbHomePage_webpage', '1'
 Solution to get a working nutch2:
 …
 https://github.com/ArchiveTeam/wpull
+-------------------
+Running nutch 2.x
+-------------------
+LINKS
+https://lucene.472066.n3.nabble.com/Nutch-2-x-readdb-command-dump-td4033937.html
+https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
+https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/ ## most useful for running nutch 2.x crawls
+https://www.mobomo.com/2017/06/the-basics-working-with-nutch-2-x/
+    "Fetch
+    This is where the magic happens.  During the fetch step, Nutch crawls the urls selected in the generate step.  The most important argument you need is -threads: this sets the number of fetcher threads per task.  Increasing this will make crawling faster, but setting it too high can overwhelm a site and it might shut out your crawler, as well as take up too much memory from your machine.  Run it like this:
+    $ nutch fetch -threads 50"
+https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-nutch-tutorial/
+https://www.yegor256.com/2019/04/17/nutch-from-java.html
+http://nutch.sourceforge.net/docs/en/tutorial.html
+Intranet: Configuration
+To configure things for intranet crawling you must:
+    Create a flat file of root urls. For example, to crawl the nutch.org site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like:
+    http://www.nutch.org/
+    Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read:
+    +^http://([a-z0-9]*\.)*nutch.org/
+    This will include any url in the domain nutch.org.
+Intranet: Running the Crawl
+Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
+    -dir dir names the directory to put the crawl in.
+    -depth depth indicates the link depth from the root page that should be crawled.
+    -delay delay determines the number of seconds between accesses to each host.
+    -threads threads determines the number of threads that will fetch in parallel.
+For example, a typical call might be:
+bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
+Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10. <===========
+Once crawling has completed, one can skip to the Searching section below.
+-----------------------------------
+Actually running nutch 2.x - steps
+-----------------------------------
+MANUALLY GOING THROUGH THE CYCLE 3 TIMES:
+cd ~/apache-nutch-2.3.1/runtime/local
+./bin/nutch inject urls
+./bin/nutch generate -topN 50
+./bin/nutch fetch -all
+./bin/nutch parse -all
+./bin/nutch updatedb -all
+./bin/nutch generate -topN 50
+./bin/nutch fetch -all
+./bin/nutch parse -all
+./bin/nutch updatedb -all
+./bin/nutch generate -topN 50
+./bin/nutch fetch -all
+./bin/nutch parse -all
+./bin/nutch updatedb -all
+Dump output on local filesystem:
+    rm -rf /tmp/bla
+    ./bin/nutch readdb -dump /tmp/bla
+    less /tmp/bla/part-r-00000
+To dump output on local filesystem:
+   Need hdfs host name if sending/dumping nutch crawl output to a location on hdfs
+   Host is defined in /usr/local/hadoop/etc/hadoop/core-site.xml for property fs.defaultFS, (https://stackoverflow.com/questions/27956973/java-io-ioexception-incomplete-hdfs-uri-no-host)
+   host is hdfs://node2/ in this case.
+   So:
+    hdfs dfs -rmdir /user/vagrant/dump
+    XXX ./bin/nutch readdb -dump user/vagrant/dump -text ### won't work
+    XXX ./bin/nutch readdb -dump hdfs:///user/vagrant/dump -text ### won't work
+    ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump -text
+USING THE SCRIPT TO ATTEMPT TO CRAWL A SITE
+* Choosing to repeat the cycle 10 times because, as per http://nutch.sourceforge.net/docs/en/tutorial.html
+"Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10."
+* Use the ./bin/crawl script, provide the seed urls dir, the crawlId and number of times to repeat = 10
+vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage 10
+* View the downloaded crawls.
+This time need to provide crawlId to readdb, in order to get a dump of its text contents:
+   hdfs dfs -rm -r hdfs://node2/user/vagrant/dump2
+   ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump2 -text -crawlId davidbHomePage
+* View the contents:
+hdfs dfs -cat hdfs://node2/user/vagrant/dump2/part-r-*
+* FIND OUT NUMBER OF URLS DOWNLOADED FOR THE SITE:
+vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage
+WebTable statistics start
+Statistics for WebTable:
+retry 0:    44
+status 5 (status_redir_perm):   4
+status 3 (status_gone): 1
+status 2 (status_fetched):  39
+jobs:   {[davidbHomePage]db_stats-job_local647846559_0001={jobName=[davidbHomePage]db_stats, jobID=job_local647846559_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=0, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=595591168}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788140, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
+TOTAL urls: 44
+max score:  1.0
+avg score:  0.022727273
+min score:  0.0
+WebTable statistics: done
+------------------------------------
+STOPPING CONDITION
+Seems inbuilt
+* When I tell it to cycle 15 times, it stops after 6 cycles saying no more URLs to fetch:
+vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage2 15
+---
+No SOLRURL specified. Skipping indexing.
+Injecting seed URLs
+...
+Thu Oct 3 09:22:23 UTC 2019 : Iteration 6 of 15
+Generating batchId
+Generating a new fetchlist
+...
+Generating batchId
+Generating a new fetchlist
+/home/vagrant/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId davidbHomePage2 -batchId 1570094569-27637
+GeneratorJob: starting at 2019-10-03 09:22:49
+GeneratorJob: Selecting best-scoring urls due for fetch.
+GeneratorJob: starting
+GeneratorJob: filtering: false
+GeneratorJob: normalizing: false
+GeneratorJob: topN: 50000
+GeneratorJob: finished at 2019-10-03 09:22:52, time elapsed: 00:00:02
+GeneratorJob: generated batch id: 1570094569-27637 containing 0 URLs
+Generate returned 1 (no new segments created)
+Escaping loop: no more URLs to fetch now
+vagrant@node2:~/apache-nutch-2.3.1/runtime/local$
+---
+* running readdb -stats show 44 URLs fetched, just as first time (when crawlId had been "davidbHomePage"):
+vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage2
+---
+WebTable statistics start
+Statistics for WebTable:
+retry 0:    44
+status 5 (status_redir_perm):   4
+status 3 (status_gone): 1
+status 2 (status_fetched):  39
+jobs:   {[davidbHomePage2]db_stats-job_local985519583_0001={jobName=[davidbHomePage2]db_stats, jobID=job_local985519583_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=4, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=552599552}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788152, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
+TOTAL urls: 44
+---

gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

r33543	r33545
433	433	4. Since trying to go install the crawl url didn't work
434	434	https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
	435	[https://stackoverflow.com/questions/26694271/go-install-doesnt-create-any-bin-file]
435	436
436	437	vagrant@node2:~/go/src$

Note: See TracChangeset for help on using the changeset viewer.