- Timestamp:
- 2019-08-16T22:15:40+12:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33423 r33425 1 sudo apt-get install maven 2 (or sudo apt update 3 sudo apt install maven) 4 git clone https://github.com/commoncrawl/cc-index-table.git 5 cd cc-index-table 6 mvn package 7 vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table 8 9 10 11 12 spark: 13 https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html 14 15 ============ 1 16 Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing: 2 17 … … 6 21 * Guide: https://www.vagrantup.com/intro/getting-started/index.html 7 22 * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know 23 * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html 24 * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box 25 * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box 26 sudo apt-get -y install firefox 27 * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a 28 29 * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml 30 * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/ 31 --- 32 ==> node1: Forwarding ports... 33 node1: 8080 (guest) => 8081 (host) (adapter 1) 34 node1: 8088 (guest) => 8089 (host) (adapter 1) 35 node1: 9083 (guest) => 9084 (host) (adapter 1) 36 node1: 4040 (guest) => 4041 (host) (adapter 1) 37 node1: 18888 (guest) => 18889 (host) (adapter 1) 38 node1: 16010 (guest) => 16011 (host) (adapter 1) 39 node1: 22 (guest) => 2200 (host) (adapter 1) 40 ==> node1: Running 'pre-boot' VM customizations... 41 42 43 ==> node1: Checking for guest additions in VM... 44 node1: The guest additions on this VM do not match the installed version of 45 node1: VirtualBox! In most cases this is fine, but in rare cases it can 46 node1: prevent things such as shared folders from working properly. If you see 47 node1: shared folder errors, please make sure the guest additions within the 48 node1: virtual machine match the version of VirtualBox you have installed on 49 node1: your host and reload your VM. 50 node1: 51 node1: Guest Additions Version: 5.1.38 52 node1: VirtualBox Version: 5.2 53 8 54 ------------ 9 55 … … 43 89 SPARK (Spark SQL): https://github.com/commoncrawl/cc-index-table 44 90 with example on selecting languages 45 91 https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cluster.idx 92 93 ./convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table 46 94 --- 47 95
Note:
See TracChangeset
for help on using the changeset viewer.