source: gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt@ 33440

Last change on this file since 33440 was 33440, checked in by ak19, 5 years ago

Split file to move vagrant-spark-hadoop notes into own file.

File size: 9.1 KB
Line 
1To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
21. ssh analytics -Y
32. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
4or
5vagrant ssh -- -Y node1
6(the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
7
8Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
9- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
10- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
11
12
13
14
15WET example from https://github.com/commoncrawl/cc-warc-examples
16
17vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
18vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
19vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
20Found 1 items
21-rw-r--r-- 1 vagrant supergroup 154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
22vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
23
24<ONCE FINISHED:>
25
26vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
27
28
29
30INFO ON HADOOP/HDFS:
31https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
32
33SPARK:
34configure option example: https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions
35
36
37
38LIKE '%isl%'
39
40cd cc-index-table
41APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
42> $SPARK_HOME/bin/spark-submit \
43# $SPARK_ON_YARN \
44 --conf spark.hadoop.parquet.enable.dictionary=true \
45 --conf spark.hadoop.parquet.enable.summary-metadata=false \
46 --conf spark.sql.hive.metastorePartitionPruning=true \
47 --conf spark.sql.parquet.filterPushdown=true \
48 --conf spark.sql.parquet.mergeSchema=true \
49 --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \
50 --query "SELECT url, warc_filename, warc_record_offset, warc_record_length
51 FROM ccindex
52 WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages = LIKE '%mri%'" \
53 --numOutputPartitions 12 \
54 --numRecordsPerWarcFile 20000 \
55 --warcPrefix ICELANDIC-CC-2018-43 \
56 s3://commoncrawl/cc-index/table/cc-main/warc/ \
57 .../my_output_path/
58
59
60----------------
61Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
62
63
64https://stackoverflow.com/questions/39355354/spark-no-filesystem-for-scheme-https-cannot-load-files-from-amazon-s3
65https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3
66"2018-01-10 Update Hadoop 3.0 has cut its s3: and s3n implementations: s3a is all you get. It is now significantly better than its predecessor and performs as least as good as the Amazon implementation. Amazon's "s3:" is still offered by EMR, which is their closed source client. Consult the EMR docs for more info."
67
681. https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
69
70"Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
71
72Here are the key parts, as of December 2015:
73
74 Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
75
76 You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
77
78 You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
79
80 In spark.properties you probably want some settings that look like this:
81
82 spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
83 spark.hadoop.fs.s3a.access.key=ACCESSKEY
84 spark.hadoop.fs.s3a.secret.key=SECRETKEY
85
86I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them."
87
88
892. The classpath used by hadoop can be found by running the command (https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath):
90 hadoop classpath
91
92
933. Got hadoop-aws 2.7.6 jar
94from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
95and put it into /home/vagrant
96
97
984. https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath
99https://stackoverflow.com/questions/28520821/how-to-add-external-jar-to-hadoop-job/54459211#54459211
100vagrant@node1:~$ export LIBJARS=/home/vagrant/hadoop-aws-2.7.6.jar
101vagrant@node1:~$ export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
102vagrant@node1:~$ hadoop classpath
103
1045. https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
105"Download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory. You should see the aws-java-sdk-1.10.65.jar in /usr/hdp/2.4.0.0-169/hadoop/"
106
107I got version 1.11
108
109[Can't find a spark.properties file, but this seems to contain spark specific properties:
110$SPARK_HOME/conf/spark-defaults.conf
111
112https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-properties.html
113"The default Spark properties file is $SPARK_HOME/conf/spark-defaults.conf that could be overriden using spark-submit with the --properties-file command-line option."]
114
115Can SUDO COPY the 2 jar files hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar to:
116/usr/local/hadoop/share/hadoop/common/
117(else /usr/local/hadoop/share/hadoop/hdfs/hadoop-aws-2.7.6.jar)
118
119--------
120schema
121https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
122
123---------------
124More examples to try:
125https://github.com/commoncrawl/cc-warc-examples
126
127
128A bit outdated?
129https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
130https://www.journaldev.com/20261/apache-spark
131
132--------
133
134sudo apt-get install maven
135(or sudo apt update
136sudo apt install maven)
137git clone https://github.com/commoncrawl/cc-index-table.git
138cd cc-index-table
139mvn package
140vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
141
142
143
144
145spark:
146https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
147
148============
149Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
150
151https://github.com/martinprobson/vagrant-hadoop-hive-spark
152
153Vagrant:
154 * Guide: https://www.vagrantup.com/intro/getting-started/index.html
155 * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
156 * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
157 * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
158 * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
159 sudo apt-get -y install firefox
160 * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
161
162 * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
163 * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
164---
165==> node1: Forwarding ports...
166 node1: 8080 (guest) => 8081 (host) (adapter 1)
167 node1: 8088 (guest) => 8089 (host) (adapter 1)
168 node1: 9083 (guest) => 9084 (host) (adapter 1)
169 node1: 4040 (guest) => 4041 (host) (adapter 1)
170 node1: 18888 (guest) => 18889 (host) (adapter 1)
171 node1: 16010 (guest) => 16011 (host) (adapter 1)
172 node1: 22 (guest) => 2200 (host) (adapter 1)
173==> node1: Running 'pre-boot' VM customizations...
174
175
176==> node1: Checking for guest additions in VM...
177 node1: The guest additions on this VM do not match the installed version of
178 node1: VirtualBox! In most cases this is fine, but in rare cases it can
179 node1: prevent things such as shared folders from working properly. If you see
180 node1: shared folder errors, please make sure the guest additions within the
181 node1: virtual machine match the version of VirtualBox you have installed on
182 node1: your host and reload your VM.
183 node1:
184 node1: Guest Additions Version: 5.1.38
185 node1: VirtualBox Version: 5.2
186
187------------
Note: See TracBrowser for help on using the repository browser.