Changeset 33545

Show
Ignore:
Timestamp:
03.10.2019 22:38:00 (2 weeks ago)
Author:
ak19
Message:

Mainly changes to crawling-Nutch.txt and some minor changes to other txt files. crawling-Nutch.txt now documents my attempts to successfully run nutch v2 on the davidb homepage site and crawl it entirely and dump the text output into the local or hadoop filesystem. I also ran 2 different numbers of nutch cycles (generate-fetch-parse-updatedb) to download the site: 10 cycles and 15 cycles. I paid attention to the output the second time, it stopped after 6 cycles saying there was nothing new to fetch. So it seems to have a built-in termination test, allowing site mirroring. Running readdb with the -stats flag allowed me to check that both times, it downloaded 44 URLs.

Location:
gs3-extensions/maori-lang-detection
Files:
3 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33499 r33545  
    33https://www.guru99.com/create-your-first-hadoop-program.html 
    44 
     5Some Hadoop commands 
     6* https://community.cloudera.com/t5/Support-Questions/Closed-How-to-store-output-of-shell-script-in-HDFS/td-p/229933 
     7* https://stackoverflow.com/questions/26513861/checking-if-directory-in-hdfs-already-exists-or-not 
    58-------------- 
    69To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics: 
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33541 r33545  
    7676/usr/local/hbase/bin/hbase shell 
    7777https://learnhbase.net/2013/03/02/hbase-shell-commands/ 
    78  
    79  
    80 list 
     78http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/ 
     79 
     80> list 
    8181 
    8282davidbHomePage_webpage is a table 
    8383 
     84> get 'davidbHomePage_webpage', '1' 
    8485 
    8586Solution to get a working nutch2: 
     
    101102https://github.com/ArchiveTeam/wpull 
    102103 
     104------------------- 
     105 
     106Running nutch 2.x 
     107 
     108------------------- 
     109 
     110LINKS 
     111 
     112https://lucene.472066.n3.nabble.com/Nutch-2-x-readdb-command-dump-td4033937.html 
     113https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions 
     114 
     115 
     116https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/ ## most useful for running nutch 2.x crawls 
     117 
     118https://www.mobomo.com/2017/06/the-basics-working-with-nutch-2-x/ 
     119    "Fetch 
     120 
     121    This is where the magic happens.  During the fetch step, Nutch crawls the urls selected in the generate step.  The most important argument you need is -threads: this sets the number of fetcher threads per task.  Increasing this will make crawling faster, but setting it too high can overwhelm a site and it might shut out your crawler, as well as take up too much memory from your machine.  Run it like this: 
     122    $ nutch fetch -threads 50" 
     123 
     124 
     125https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-nutch-tutorial/ 
     126https://www.yegor256.com/2019/04/17/nutch-from-java.html 
     127 
     128http://nutch.sourceforge.net/docs/en/tutorial.html 
     129 
     130Intranet: Configuration 
     131To configure things for intranet crawling you must: 
     132 
     133    Create a flat file of root urls. For example, to crawl the nutch.org site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like: 
     134 
     135    http://www.nutch.org/ 
     136 
     137    Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read: 
     138 
     139    +^http://([a-z0-9]*\.)*nutch.org/ 
     140 
     141    This will include any url in the domain nutch.org. 
     142 
     143Intranet: Running the Crawl 
     144Once things are configured, running the crawl is easy. Just use the crawl command. Its options include: 
     145 
     146    -dir dir names the directory to put the crawl in. 
     147    -depth depth indicates the link depth from the root page that should be crawled. 
     148    -delay delay determines the number of seconds between accesses to each host. 
     149    -threads threads determines the number of threads that will fetch in parallel. 
     150 
     151For example, a typical call might be: 
     152 
     153bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log 
     154 
     155Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10. <=========== 
     156 
     157Once crawling has completed, one can skip to the Searching section below. 
     158 
     159 
     160----------------------------------- 
     161Actually running nutch 2.x - steps 
     162----------------------------------- 
     163MANUALLY GOING THROUGH THE CYCLE 3 TIMES: 
     164 
     165cd ~/apache-nutch-2.3.1/runtime/local 
     166 
     167./bin/nutch inject urls 
     168 
     169./bin/nutch generate -topN 50 
     170./bin/nutch fetch -all 
     171./bin/nutch parse -all 
     172./bin/nutch updatedb -all 
     173 
     174./bin/nutch generate -topN 50 
     175./bin/nutch fetch -all 
     176./bin/nutch parse -all 
     177./bin/nutch updatedb -all 
     178 
     179./bin/nutch generate -topN 50 
     180./bin/nutch fetch -all 
     181./bin/nutch parse -all 
     182./bin/nutch updatedb -all 
     183 
     184Dump output on local filesystem: 
     185    rm -rf /tmp/bla 
     186    ./bin/nutch readdb -dump /tmp/bla 
     187    less /tmp/bla/part-r-00000 
     188 
     189To dump output on local filesystem: 
     190   Need hdfs host name if sending/dumping nutch crawl output to a location on hdfs 
     191   Host is defined in /usr/local/hadoop/etc/hadoop/core-site.xml for property fs.defaultFS, (https://stackoverflow.com/questions/27956973/java-io-ioexception-incomplete-hdfs-uri-no-host) 
     192   host is hdfs://node2/ in this case. 
     193   So: 
     194 
     195    hdfs dfs -rmdir /user/vagrant/dump 
     196    XXX ./bin/nutch readdb -dump user/vagrant/dump -text ### won't work 
     197    XXX ./bin/nutch readdb -dump hdfs:///user/vagrant/dump -text ### won't work 
     198    ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump -text 
     199 
     200 
     201USING THE SCRIPT TO ATTEMPT TO CRAWL A SITE 
     202* Choosing to repeat the cycle 10 times because, as per http://nutch.sourceforge.net/docs/en/tutorial.html 
     203 
     204"Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10."  
     205 
     206* Use the ./bin/crawl script, provide the seed urls dir, the crawlId and number of times to repeat = 10  
     207vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage 10 
     208 
     209 
     210* View the downloaded crawls. 
     211This time need to provide crawlId to readdb, in order to get a dump of its text contents: 
     212   hdfs dfs -rm -r hdfs://node2/user/vagrant/dump2 
     213   ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump2 -text -crawlId davidbHomePage 
     214 
     215* View the contents: 
     216hdfs dfs -cat hdfs://node2/user/vagrant/dump2/part-r-* 
     217 
     218 
     219* FIND OUT NUMBER OF URLS DOWNLOADED FOR THE SITE: 
     220vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage 
     221WebTable statistics start 
     222Statistics for WebTable:  
     223retry 0:    44 
     224status 5 (status_redir_perm):   4 
     225status 3 (status_gone): 1 
     226status 2 (status_fetched):  39 
     227jobs:   {[davidbHomePage]db_stats-job_local647846559_0001={jobName=[davidbHomePage]db_stats, jobID=job_local647846559_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=0, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=595591168}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788140, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}} 
     228TOTAL urls: 44 
     229max score:  1.0 
     230avg score:  0.022727273 
     231min score:  0.0 
     232WebTable statistics: done 
     233 
     234------------------------------------ 
     235STOPPING CONDITION 
     236Seems inbuilt 
     237* When I tell it to cycle 15 times, it stops after 6 cycles saying no more URLs to fetch: 
     238 
     239vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage2 15 
     240--- 
     241No SOLRURL specified. Skipping indexing. 
     242Injecting seed URLs 
     243 
     244... 
     245 
     246Thu Oct 3 09:22:23 UTC 2019 : Iteration 6 of 15 
     247Generating batchId 
     248Generating a new fetchlist 
     249... 
     250Generating batchId 
     251Generating a new fetchlist 
     252/home/vagrant/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId davidbHomePage2 -batchId 1570094569-27637 
     253GeneratorJob: starting at 2019-10-03 09:22:49 
     254GeneratorJob: Selecting best-scoring urls due for fetch. 
     255GeneratorJob: starting 
     256GeneratorJob: filtering: false 
     257GeneratorJob: normalizing: false 
     258GeneratorJob: topN: 50000 
     259GeneratorJob: finished at 2019-10-03 09:22:52, time elapsed: 00:00:02 
     260GeneratorJob: generated batch id: 1570094569-27637 containing 0 URLs 
     261Generate returned 1 (no new segments created) 
     262Escaping loop: no more URLs to fetch now 
     263vagrant@node2:~/apache-nutch-2.3.1/runtime/local$  
     264--- 
     265 
     266* running readdb -stats show 44 URLs fetched, just as first time (when crawlId had been "davidbHomePage"): 
     267 
     268vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage2 
     269--- 
     270WebTable statistics start 
     271Statistics for WebTable:  
     272retry 0:    44 
     273status 5 (status_redir_perm):   4 
     274status 3 (status_gone): 1 
     275status 2 (status_fetched):  39 
     276jobs:   {[davidbHomePage2]db_stats-job_local985519583_0001={jobName=[davidbHomePage2]db_stats, jobID=job_local985519583_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=4, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=552599552}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788152, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}} 
     277TOTAL urls: 44 
     278--- 
     279 
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33543 r33545  
    4334334. Since trying to go install the crawl url didn't work 
    434434https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main 
     435[https://stackoverflow.com/questions/26694271/go-install-doesnt-create-any-bin-file] 
    435436 
    436437vagrant@node2:~/go/src$