Changeset 33425


Ignore:
Timestamp:
2019-08-16T22:15:40+12:00 (5 years ago)
Author:
ak19
Message:

A few more links now that I got past getting the vagrant VM with spark and hadoop working.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33423 r33425  
     1sudo apt-get install maven
     2(or sudo apt update
     3sudo apt install maven)
     4git clone https://github.com/commoncrawl/cc-index-table.git
     5cd cc-index-table
     6mvn package
     7vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
     8
     9
     10
     11
     12spark:
     13https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
     14
     15============
    116Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
    217
     
    621    * Guide: https://www.vagrantup.com/intro/getting-started/index.html
    722    * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
     23    * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
     24    * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
     25    * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
     26      sudo apt-get -y install firefox
     27    * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
     28
     29    * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
     30    * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
     31---
     32==> node1: Forwarding ports...
     33    node1: 8080 (guest) => 8081 (host) (adapter 1)
     34    node1: 8088 (guest) => 8089 (host) (adapter 1)
     35    node1: 9083 (guest) => 9084 (host) (adapter 1)
     36    node1: 4040 (guest) => 4041 (host) (adapter 1)
     37    node1: 18888 (guest) => 18889 (host) (adapter 1)
     38    node1: 16010 (guest) => 16011 (host) (adapter 1)
     39    node1: 22 (guest) => 2200 (host) (adapter 1)
     40==> node1: Running 'pre-boot' VM customizations...
     41
     42
     43==> node1: Checking for guest additions in VM...
     44    node1: The guest additions on this VM do not match the installed version of
     45    node1: VirtualBox! In most cases this is fine, but in rare cases it can
     46    node1: prevent things such as shared folders from working properly. If you see
     47    node1: shared folder errors, please make sure the guest additions within the
     48    node1: virtual machine match the version of VirtualBox you have installed on
     49    node1: your host and reload your VM.
     50    node1:
     51    node1: Guest Additions Version: 5.1.38
     52    node1: VirtualBox Version: 5.2
     53
    854------------
    955
     
    4389SPARK (Spark SQL): https://github.com/commoncrawl/cc-index-table
    4490    with example on selecting languages
    45 
     91https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cluster.idx
     92
     93./convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
    4694---
    4795
Note: See TracChangeset for help on using the changeset viewer.