Ignore:
Timestamp:
2019-10-03T22:38:00+13:00 (4 years ago)
Author:
ak19
Message:

Mainly changes to crawling-Nutch.txt and some minor changes to other txt files. crawling-Nutch.txt now documents my attempts to successfully run nutch v2 on the davidb homepage site and crawl it entirely and dump the text output into the local or hadoop filesystem. I also ran 2 different numbers of nutch cycles (generate-fetch-parse-updatedb) to download the site: 10 cycles and 15 cycles. I paid attention to the output the second time, it stopped after 6 cycles saying there was nothing new to fetch. So it seems to have a built-in termination test, allowing site mirroring. Running readdb with the -stats flag allowed me to check that both times, it downloaded 44 URLs.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33541 r33545  
    7676/usr/local/hbase/bin/hbase shell
    7777https://learnhbase.net/2013/03/02/hbase-shell-commands/
    78 
    79 
    80 list
     78http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/
     79
     80> list
    8181
    8282davidbHomePage_webpage is a table
    8383
     84> get 'davidbHomePage_webpage', '1'
    8485
    8586Solution to get a working nutch2:
     
    101102https://github.com/ArchiveTeam/wpull
    102103
     104-------------------
     105
     106Running nutch 2.x
     107
     108-------------------
     109
     110LINKS
     111
     112https://lucene.472066.n3.nabble.com/Nutch-2-x-readdb-command-dump-td4033937.html
     113https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
     114
     115
     116https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/ ## most useful for running nutch 2.x crawls
     117
     118https://www.mobomo.com/2017/06/the-basics-working-with-nutch-2-x/
     119    "Fetch
     120
     121    This is where the magic happens.  During the fetch step, Nutch crawls the urls selected in the generate step.  The most important argument you need is -threads: this sets the number of fetcher threads per task.  Increasing this will make crawling faster, but setting it too high can overwhelm a site and it might shut out your crawler, as well as take up too much memory from your machine.  Run it like this:
     122    $ nutch fetch -threads 50"
     123
     124
     125https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-nutch-tutorial/
     126https://www.yegor256.com/2019/04/17/nutch-from-java.html
     127
     128http://nutch.sourceforge.net/docs/en/tutorial.html
     129
     130Intranet: Configuration
     131To configure things for intranet crawling you must:
     132
     133    Create a flat file of root urls. For example, to crawl the nutch.org site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like:
     134
     135    http://www.nutch.org/
     136
     137    Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read:
     138
     139    +^http://([a-z0-9]*\.)*nutch.org/
     140
     141    This will include any url in the domain nutch.org.
     142
     143Intranet: Running the Crawl
     144Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
     145
     146    -dir dir names the directory to put the crawl in.
     147    -depth depth indicates the link depth from the root page that should be crawled.
     148    -delay delay determines the number of seconds between accesses to each host.
     149    -threads threads determines the number of threads that will fetch in parallel.
     150
     151For example, a typical call might be:
     152
     153bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
     154
     155Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10. <===========
     156
     157Once crawling has completed, one can skip to the Searching section below.
     158
     159
     160-----------------------------------
     161Actually running nutch 2.x - steps
     162-----------------------------------
     163MANUALLY GOING THROUGH THE CYCLE 3 TIMES:
     164
     165cd ~/apache-nutch-2.3.1/runtime/local
     166
     167./bin/nutch inject urls
     168
     169./bin/nutch generate -topN 50
     170./bin/nutch fetch -all
     171./bin/nutch parse -all
     172./bin/nutch updatedb -all
     173
     174./bin/nutch generate -topN 50
     175./bin/nutch fetch -all
     176./bin/nutch parse -all
     177./bin/nutch updatedb -all
     178
     179./bin/nutch generate -topN 50
     180./bin/nutch fetch -all
     181./bin/nutch parse -all
     182./bin/nutch updatedb -all
     183
     184Dump output on local filesystem:
     185    rm -rf /tmp/bla
     186    ./bin/nutch readdb -dump /tmp/bla
     187    less /tmp/bla/part-r-00000
     188
     189To dump output on local filesystem:
     190   Need hdfs host name if sending/dumping nutch crawl output to a location on hdfs
     191   Host is defined in /usr/local/hadoop/etc/hadoop/core-site.xml for property fs.defaultFS, (https://stackoverflow.com/questions/27956973/java-io-ioexception-incomplete-hdfs-uri-no-host)
     192   host is hdfs://node2/ in this case.
     193   So:
     194
     195    hdfs dfs -rmdir /user/vagrant/dump
     196    XXX ./bin/nutch readdb -dump user/vagrant/dump -text ### won't work
     197    XXX ./bin/nutch readdb -dump hdfs:///user/vagrant/dump -text ### won't work
     198    ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump -text
     199
     200
     201USING THE SCRIPT TO ATTEMPT TO CRAWL A SITE
     202* Choosing to repeat the cycle 10 times because, as per http://nutch.sourceforge.net/docs/en/tutorial.html
     203
     204"Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10."
     205
     206* Use the ./bin/crawl script, provide the seed urls dir, the crawlId and number of times to repeat = 10
     207vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage 10
     208
     209
     210* View the downloaded crawls.
     211This time need to provide crawlId to readdb, in order to get a dump of its text contents:
     212   hdfs dfs -rm -r hdfs://node2/user/vagrant/dump2
     213   ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump2 -text -crawlId davidbHomePage
     214
     215* View the contents:
     216hdfs dfs -cat hdfs://node2/user/vagrant/dump2/part-r-*
     217
     218
     219* FIND OUT NUMBER OF URLS DOWNLOADED FOR THE SITE:
     220vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage
     221WebTable statistics start
     222Statistics for WebTable:
     223retry 0:    44
     224status 5 (status_redir_perm):   4
     225status 3 (status_gone): 1
     226status 2 (status_fetched):  39
     227jobs:   {[davidbHomePage]db_stats-job_local647846559_0001={jobName=[davidbHomePage]db_stats, jobID=job_local647846559_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=0, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=595591168}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788140, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
     228TOTAL urls: 44
     229max score:  1.0
     230avg score:  0.022727273
     231min score:  0.0
     232WebTable statistics: done
     233
     234------------------------------------
     235STOPPING CONDITION
     236Seems inbuilt
     237* When I tell it to cycle 15 times, it stops after 6 cycles saying no more URLs to fetch:
     238
     239vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage2 15
     240---
     241No SOLRURL specified. Skipping indexing.
     242Injecting seed URLs
     243
     244...
     245
     246Thu Oct 3 09:22:23 UTC 2019 : Iteration 6 of 15
     247Generating batchId
     248Generating a new fetchlist
     249...
     250Generating batchId
     251Generating a new fetchlist
     252/home/vagrant/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId davidbHomePage2 -batchId 1570094569-27637
     253GeneratorJob: starting at 2019-10-03 09:22:49
     254GeneratorJob: Selecting best-scoring urls due for fetch.
     255GeneratorJob: starting
     256GeneratorJob: filtering: false
     257GeneratorJob: normalizing: false
     258GeneratorJob: topN: 50000
     259GeneratorJob: finished at 2019-10-03 09:22:52, time elapsed: 00:00:02
     260GeneratorJob: generated batch id: 1570094569-27637 containing 0 URLs
     261Generate returned 1 (no new segments created)
     262Escaping loop: no more URLs to fetch now
     263vagrant@node2:~/apache-nutch-2.3.1/runtime/local$
     264---
     265
     266* running readdb -stats show 44 URLs fetched, just as first time (when crawlId had been "davidbHomePage"):
     267
     268vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage2
     269---
     270WebTable statistics start
     271Statistics for WebTable:
     272retry 0:    44
     273status 5 (status_redir_perm):   4
     274status 3 (status_gone): 1
     275status 2 (status_fetched):  39
     276jobs:   {[davidbHomePage2]db_stats-job_local985519583_0001={jobName=[davidbHomePage2]db_stats, jobID=job_local985519583_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=4, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=552599552}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788152, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
     277TOTAL urls: 44
     278---
     279
Note: See TracChangeset for help on using the changeset viewer.