- Timestamp:
- 2019-08-28T19:17:42+12:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33428 r33440 1 To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:2 1. ssh analytics -Y3 2. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y4 or5 vagrant ssh -- -Y node16 (the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)7 8 Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.1019 - Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/10 - If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.11 12 13 14 15 WET example from https://github.com/commoncrawl/cc-warc-examples16 17 vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data18 vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.19 vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data20 Found 1 items21 -rw-r--r-- 1 vagrant supergroup 154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz22 vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount23 24 <ONCE FINISHED:>25 26 vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*27 28 29 30 INFO ON HADOOP/HDFS:31 https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/32 33 ---------------34 More examples to try:35 https://github.com/commoncrawl/cc-warc-examples36 37 38 A bit outdated?39 https://www.journaldev.com/20342/apache-spark-example-word-count-program-java40 https://www.journaldev.com/20261/apache-spark41 42 --------43 44 sudo apt-get install maven45 (or sudo apt update46 sudo apt install maven)47 git clone https://github.com/commoncrawl/cc-index-table.git48 cd cc-index-table49 mvn package50 vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table51 52 53 54 55 spark:56 https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html57 58 ============59 Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:60 61 https://github.com/martinprobson/vagrant-hadoop-hive-spark62 63 Vagrant:64 * Guide: https://www.vagrantup.com/intro/getting-started/index.html65 * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know66 * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html67 * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box68 * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box69 sudo apt-get -y install firefox70 * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a71 72 * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml73 * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/74 ---75 ==> node1: Forwarding ports...76 node1: 8080 (guest) => 8081 (host) (adapter 1)77 node1: 8088 (guest) => 8089 (host) (adapter 1)78 node1: 9083 (guest) => 9084 (host) (adapter 1)79 node1: 4040 (guest) => 4041 (host) (adapter 1)80 node1: 18888 (guest) => 18889 (host) (adapter 1)81 node1: 16010 (guest) => 16011 (host) (adapter 1)82 node1: 22 (guest) => 2200 (host) (adapter 1)83 ==> node1: Running 'pre-boot' VM customizations...84 85 86 ==> node1: Checking for guest additions in VM...87 node1: The guest additions on this VM do not match the installed version of88 node1: VirtualBox! In most cases this is fine, but in rare cases it can89 node1: prevent things such as shared folders from working properly. If you see90 node1: shared folder errors, please make sure the guest additions within the91 node1: virtual machine match the version of VirtualBox you have installed on92 node1: your host and reload your VM.93 node1:94 node1: Guest Additions Version: 5.1.3895 node1: VirtualBox Version: 5.296 97 ------------98 1 99 2 At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says
Note:
See TracChangeset
for help on using the changeset viewer.