Ignore:
Timestamp:
2019-08-28T19:17:42+12:00 (5 years ago)
Author:
ak19
Message:

Split file to move vagrant-spark-hadoop notes into own file.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33428 r33440  
    1 To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
    2 1. ssh analytics -Y
    3 2. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
    4 or
    5 vagrant ssh -- -Y node1
    6 (the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
    7 
    8 Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
    9 - Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
    10 - If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
    11 
    12 
    13 
    14 
    15 WET example from https://github.com/commoncrawl/cc-warc-examples
    16 
    17 vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
    18 vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
    19 vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
    20 Found 1 items
    21 -rw-r--r--   1 vagrant supergroup  154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
    22 vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
    23 
    24 <ONCE FINISHED:>
    25 
    26 vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
    27 
    28 
    29 
    30 INFO ON HADOOP/HDFS:
    31 https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
    32 
    33 ---------------
    34 More examples to try:
    35 https://github.com/commoncrawl/cc-warc-examples
    36 
    37 
    38 A bit outdated?
    39 https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
    40 https://www.journaldev.com/20261/apache-spark
    41 
    42 --------
    43 
    44 sudo apt-get install maven
    45 (or sudo apt update
    46 sudo apt install maven)
    47 git clone https://github.com/commoncrawl/cc-index-table.git
    48 cd cc-index-table
    49 mvn package
    50 vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
    51 
    52 
    53 
    54 
    55 spark:
    56 https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
    57 
    58 ============
    59 Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
    60 
    61 https://github.com/martinprobson/vagrant-hadoop-hive-spark
    62 
    63 Vagrant:
    64     * Guide: https://www.vagrantup.com/intro/getting-started/index.html
    65     * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
    66     * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
    67     * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
    68     * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
    69       sudo apt-get -y install firefox
    70     * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
    71 
    72     * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
    73     * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
    74 ---
    75 ==> node1: Forwarding ports...
    76     node1: 8080 (guest) => 8081 (host) (adapter 1)
    77     node1: 8088 (guest) => 8089 (host) (adapter 1)
    78     node1: 9083 (guest) => 9084 (host) (adapter 1)
    79     node1: 4040 (guest) => 4041 (host) (adapter 1)
    80     node1: 18888 (guest) => 18889 (host) (adapter 1)
    81     node1: 16010 (guest) => 16011 (host) (adapter 1)
    82     node1: 22 (guest) => 2200 (host) (adapter 1)
    83 ==> node1: Running 'pre-boot' VM customizations...
    84 
    85 
    86 ==> node1: Checking for guest additions in VM...
    87     node1: The guest additions on this VM do not match the installed version of
    88     node1: VirtualBox! In most cases this is fine, but in rare cases it can
    89     node1: prevent things such as shared folders from working properly. If you see
    90     node1: shared folder errors, please make sure the guest additions within the
    91     node1: virtual machine match the version of VirtualBox you have installed on
    92     node1: your host and reload your VM.
    93     node1:
    94     node1: Guest Additions Version: 5.1.38
    95     node1: VirtualBox Version: 5.2
    96 
    97 ------------
    981
    992At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says
Note: See TracChangeset for help on using the changeset viewer.