Ignore:
Timestamp:
2019-10-03T22:38:00+13:00 (5 years ago)
Author:
ak19
Message:

Mainly changes to crawling-Nutch.txt and some minor changes to other txt files. crawling-Nutch.txt now documents my attempts to successfully run nutch v2 on the davidb homepage site and crawl it entirely and dump the text output into the local or hadoop filesystem. I also ran 2 different numbers of nutch cycles (generate-fetch-parse-updatedb) to download the site: 10 cycles and 15 cycles. I paid attention to the output the second time, it stopped after 6 cycles saying there was nothing new to fetch. So it seems to have a built-in termination test, allowing site mirroring. Running readdb with the -stats flag allowed me to check that both times, it downloaded 44 URLs.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33499 r33545  
    33https://www.guru99.com/create-your-first-hadoop-program.html
    44
     5Some Hadoop commands
     6* https://community.cloudera.com/t5/Support-Questions/Closed-How-to-store-output-of-shell-script-in-HDFS/td-p/229933
     7* https://stackoverflow.com/questions/26513861/checking-if-directory-in-hdfs-already-exists-or-not
    58--------------
    69To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
Note: See TracChangeset for help on using the changeset viewer.