Changeset 33440

Show
Ignore:
Timestamp:
28.08.2019 19:17:42 (3 weeks ago)
Author:
ak19
Message:

Split file to move vagrant-spark-hadoop notes into own file.

Location:
gs3-extensions/maori-lang-detection/MoreReading
Files:
1 added
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33428 r33440  
    1 To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics: 
    2 1. ssh analytics -Y 
    3 2. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y 
    4 or 
    5 vagrant ssh -- -Y node1 
    6 (the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs) 
    7  
    8 Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101 
    9 - Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/ 
    10 - If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1. 
    11  
    12  
    13  
    14  
    15 WET example from https://github.com/commoncrawl/cc-warc-examples 
    16  
    17 vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data 
    18 vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/. 
    19 vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data 
    20 Found 1 items 
    21 -rw-r--r--   1 vagrant supergroup  154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz 
    22 vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount 
    23  
    24 <ONCE FINISHED:> 
    25  
    26 vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part* 
    27  
    28  
    29  
    30 INFO ON HADOOP/HDFS: 
    31 https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/ 
    32  
    33 --------------- 
    34 More examples to try: 
    35 https://github.com/commoncrawl/cc-warc-examples 
    36  
    37  
    38 A bit outdated? 
    39 https://www.journaldev.com/20342/apache-spark-example-word-count-program-java 
    40 https://www.journaldev.com/20261/apache-spark 
    41  
    42 -------- 
    43  
    44 sudo apt-get install maven 
    45 (or sudo apt update 
    46 sudo apt install maven) 
    47 git clone https://github.com/commoncrawl/cc-index-table.git 
    48 cd cc-index-table 
    49 mvn package 
    50 vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table 
    51  
    52  
    53  
    54  
    55 spark: 
    56 https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html 
    57  
    58 ============ 
    59 Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing: 
    60  
    61 https://github.com/martinprobson/vagrant-hadoop-hive-spark 
    62  
    63 Vagrant: 
    64     * Guide: https://www.vagrantup.com/intro/getting-started/index.html 
    65     * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know 
    66     * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html 
    67     * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box 
    68     * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box 
    69       sudo apt-get -y install firefox 
    70     * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a 
    71  
    72     * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml 
    73     * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/ 
    74 --- 
    75 ==> node1: Forwarding ports... 
    76     node1: 8080 (guest) => 8081 (host) (adapter 1) 
    77     node1: 8088 (guest) => 8089 (host) (adapter 1) 
    78     node1: 9083 (guest) => 9084 (host) (adapter 1) 
    79     node1: 4040 (guest) => 4041 (host) (adapter 1) 
    80     node1: 18888 (guest) => 18889 (host) (adapter 1) 
    81     node1: 16010 (guest) => 16011 (host) (adapter 1) 
    82     node1: 22 (guest) => 2200 (host) (adapter 1) 
    83 ==> node1: Running 'pre-boot' VM customizations... 
    84  
    85  
    86 ==> node1: Checking for guest additions in VM... 
    87     node1: The guest additions on this VM do not match the installed version of 
    88     node1: VirtualBox! In most cases this is fine, but in rare cases it can 
    89     node1: prevent things such as shared folders from working properly. If you see 
    90     node1: shared folder errors, please make sure the guest additions within the 
    91     node1: virtual machine match the version of VirtualBox you have installed on 
    92     node1: your host and reload your VM. 
    93     node1:  
    94     node1: Guest Additions Version: 5.1.38 
    95     node1: VirtualBox Version: 5.2 
    96  
    97 ------------ 
    981 
    992At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says