Changeset 33425

Show
Ignore:
Timestamp:
16.08.2019 22:15:40 (5 weeks ago)
Author:
ak19
Message:

A few more links now that I got past getting the vagrant VM with spark and hadoop working.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33423 r33425  
     1sudo apt-get install maven 
     2(or sudo apt update 
     3sudo apt install maven) 
     4git clone https://github.com/commoncrawl/cc-index-table.git 
     5cd cc-index-table 
     6mvn package 
     7vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table 
     8 
     9 
     10 
     11 
     12spark: 
     13https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html 
     14 
     15============ 
    116Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing: 
    217 
     
    621    * Guide: https://www.vagrantup.com/intro/getting-started/index.html 
    722    * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know 
     23    * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html 
     24    * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box 
     25    * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box 
     26      sudo apt-get -y install firefox 
     27    * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a 
     28 
     29    * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml 
     30    * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/ 
     31--- 
     32==> node1: Forwarding ports... 
     33    node1: 8080 (guest) => 8081 (host) (adapter 1) 
     34    node1: 8088 (guest) => 8089 (host) (adapter 1) 
     35    node1: 9083 (guest) => 9084 (host) (adapter 1) 
     36    node1: 4040 (guest) => 4041 (host) (adapter 1) 
     37    node1: 18888 (guest) => 18889 (host) (adapter 1) 
     38    node1: 16010 (guest) => 16011 (host) (adapter 1) 
     39    node1: 22 (guest) => 2200 (host) (adapter 1) 
     40==> node1: Running 'pre-boot' VM customizations... 
     41 
     42 
     43==> node1: Checking for guest additions in VM... 
     44    node1: The guest additions on this VM do not match the installed version of 
     45    node1: VirtualBox! In most cases this is fine, but in rare cases it can 
     46    node1: prevent things such as shared folders from working properly. If you see 
     47    node1: shared folder errors, please make sure the guest additions within the 
     48    node1: virtual machine match the version of VirtualBox you have installed on 
     49    node1: your host and reload your VM. 
     50    node1:  
     51    node1: Guest Additions Version: 5.1.38 
     52    node1: VirtualBox Version: 5.2 
     53 
    854------------ 
    955 
     
    4389SPARK (Spark SQL): https://github.com/commoncrawl/cc-index-table 
    4490    with example on selecting languages 
    45  
     91https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cluster.idx 
     92 
     93./convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table 
    4694--- 
    4795