Context Navigation

source: gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt@ 33441

Last change on this file since 33441 was 33441, checked in by ak19, 5 years ago
Adding further notes to do with running the CC-index examples on spark.
File size: 11.8 KB

Line
1	To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
2	1. ssh analytics -Y
3	2. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
4	or
5	vagrant ssh -- -Y node1
6	(the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
7
8	Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
9	- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
10	- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost\|10.211.55.101\|node1.
11
12	-------------------------
13
14
15	WET example from https://github.com/commoncrawl/cc-warc-examples
16
17	vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
18	vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
19	vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
20	Found 1 items
21	-rw-r--r-- 1 vagrant supergroup 154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
22	vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
23
24	<ONCE FINISHED:>
25
26	vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
27
28
29
30	INFO ON HADOOP/HDFS:
31	https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
32
33	SPARK:
34	configure option example: https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions
35
36
37
38	LIKE '%isl%'
39
40	cd cc-index-table
41	APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
42	> $SPARK_HOME/bin/spark-submit \
43	# $SPARK_ON_YARN \
44	--conf spark.hadoop.parquet.enable.dictionary=true \
45	--conf spark.hadoop.parquet.enable.summary-metadata=false \
46	--conf spark.sql.hive.metastorePartitionPruning=true \
47	--conf spark.sql.parquet.filterPushdown=true \
48	--conf spark.sql.parquet.mergeSchema=true \
49	--class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \
50	--query "SELECT url, warc_filename, warc_record_offset, warc_record_length
51	FROM ccindex
52	WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
53	--numOutputPartitions 12 \
54	--numRecordsPerWarcFile 20000 \
55	--warcPrefix ICELANDIC-CC-2018-43 \
56	s3://commoncrawl/cc-index/table/cc-main/warc/ \
57	.../my_output_path/
58
59
60	=========================================================
61	Configuring spark to work on Amazon AWS s3a dataset:
62	=========================================================
63	https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
64	http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/
65	https://answers.dataiku.com/1734/common-crawl-s3
66	https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir
67	https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml
68
69	https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w
70
71	===========================================
72	IAM Role (or user) and commoncrawl profile
73	===========================================
74
75	"iam" role or user for commoncrawl(er) profile
76
77
78	aws management console:
79	[email protected]
80	lab pwd, capital R and ! (maybe g)
81
82	commoncrawl profile created while creating the user/role, by following the instructions at: https://answers.dataiku.com/1734/common-crawl-s3
83
84	<!--
85	<property>
86	<name>fs.s3a.awsAccessKeyId</name>
87	<value>XXX</value>
88	</property>
89	<property>
90	<name>fs.s3a.awsSecretAccessKey</name>
91	<value>XXX</value>
92	</property>
93	-->
94
95
96	But instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml)
97
98	you'll want to put the Amazon AWS access key and secret key in the spark properties file:
99
100	sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
101
102
103	The spark properties should contain:
104
105	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
106	spark.hadoop.fs.s3a.access.key=ACCESSKEY
107	spark.hadoop.fs.s3a.secret.key=SECRETKEY
108
109
110	-------------
111
112	APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
113	$SPARK_HOME/bin/spark-submit \
114	--conf spark.hadoop.parquet.enable.dictionary=true \
115	--conf spark.hadoop.parquet.enable.summary-metadata=false \
116	--conf spark.sql.hive.metastorePartitionPruning=true \
117	--conf spark.sql.parquet.filterPushdown=true \
118	--conf spark.sql.parquet.mergeSchema=true \
119	--class org.commoncrawl.spark.examples.CCIndexExport $APPJAR \
120	--query "SELECT url, warc_filename, warc_record_offset, warc_record_length
121	FROM ccindex
122	WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
123	--outputFormat csv \
124	--numOutputPartitions 10 \
125	--outputCompression gzip \
126	s3://commoncrawl/cc-index/table/cc-main/warc/ \
127	hdfs:///user/vagrant/cc-mri-csv
128
129	----------------
130	Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
131
132
133	https://stackoverflow.com/questions/39355354/spark-no-filesystem-for-scheme-https-cannot-load-files-from-amazon-s3
134	https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3
135	"2018-01-10 Update Hadoop 3.0 has cut its s3: and s3n implementations: s3a is all you get. It is now significantly better than its predecessor and performs as least as good as the Amazon implementation. Amazon's "s3:" is still offered by EMR, which is their closed source client. Consult the EMR docs for more info."
136
137	1. https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
138
139	"Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
140
141	Here are the key parts, as of December 2015:
142
143	Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
144
145	You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
146
147	You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
148
149	In spark.properties you probably want some settings that look like this:
150
151	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
152	spark.hadoop.fs.s3a.access.key=ACCESSKEY
153	spark.hadoop.fs.s3a.secret.key=SECRETKEY
154
155	I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them."
156
157
158	2. The classpath used by hadoop can be found by running the command (https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath):
159	hadoop classpath
160
161
162	3. Got hadoop-aws 2.7.6 jar
163	from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
164	and put it into /home/vagrant
165
166
167	4. https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath
168	https://stackoverflow.com/questions/28520821/how-to-add-external-jar-to-hadoop-job/54459211#54459211
169	vagrant@node1:~$ export LIBJARS=/home/vagrant/hadoop-aws-2.7.6.jar
170	vagrant@node1:~$ export HADOOP_CLASSPATH=`echo ${LIBJARS} \| sed s/,/:/g`
171	vagrant@node1:~$ hadoop classpath
172
173	5. https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
174	"Download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory. You should see the aws-java-sdk-1.10.65.jar in /usr/hdp/2.4.0.0-169/hadoop/"
175
176	I got version 1.11
177
178	[Can't find a spark.properties file, but this seems to contain spark specific properties:
179	$SPARK_HOME/conf/spark-defaults.conf
180
181	https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-properties.html
182	"The default Spark properties file is $SPARK_HOME/conf/spark-defaults.conf that could be overriden using spark-submit with the --properties-file command-line option."]
183
184	Can SUDO COPY the 2 jar files hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar to:
185	/usr/local/hadoop/share/hadoop/common/
186	(else /usr/local/hadoop/share/hadoop/hdfs/hadoop-aws-2.7.6.jar)
187
188	--------
189	schema
190	https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
191
192	---------------
193	More examples to try:
194	https://github.com/commoncrawl/cc-warc-examples
195
196
197	A bit outdated?
198	https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
199	https://www.journaldev.com/20261/apache-spark
200
201	--------
202
203	sudo apt-get install maven
204	(or sudo apt update
205	sudo apt install maven)
206	git clone https://github.com/commoncrawl/cc-index-table.git
207	cd cc-index-table
208	mvn package
209	vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
210
211
212
213
214	spark:
215	https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
216
217	============
218	Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
219
220	https://github.com/martinprobson/vagrant-hadoop-hive-spark
221
222	Vagrant:
223	* Guide: https://www.vagrantup.com/intro/getting-started/index.html
224	* Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
225	* vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
226	* https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
227	* https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
228	sudo apt-get -y install firefox
229	* vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
230
231	* hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
232	* https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
233	---
234	==> node1: Forwarding ports...
235	node1: 8080 (guest) => 8081 (host) (adapter 1)
236	node1: 8088 (guest) => 8089 (host) (adapter 1)
237	node1: 9083 (guest) => 9084 (host) (adapter 1)
238	node1: 4040 (guest) => 4041 (host) (adapter 1)
239	node1: 18888 (guest) => 18889 (host) (adapter 1)
240	node1: 16010 (guest) => 16011 (host) (adapter 1)
241	node1: 22 (guest) => 2200 (host) (adapter 1)
242	==> node1: Running 'pre-boot' VM customizations...
243
244
245	==> node1: Checking for guest additions in VM...
246	node1: The guest additions on this VM do not match the installed version of
247	node1: VirtualBox! In most cases this is fine, but in rare cases it can
248	node1: prevent things such as shared folders from working properly. If you see
249	node1: shared folder errors, please make sure the guest additions within the
250	node1: virtual machine match the version of VirtualBox you have installed on
251	node1: your host and reload your VM.
252	node1:
253	node1: Guest Additions Version: 5.1.38
254	node1: VirtualBox Version: 5.2
255
256	------------

Note: See TracBrowser for help on using the repository browser.

Download in other formats: