Context Navigation

source: gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt@ 33443

Last change on this file since 33443 was 33443, checked in by ak19, 5 years ago
More notes
File size: 14.6 KB

Line
1	To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
2	1. ssh analytics -Y
3	2. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
4	or
5	vagrant ssh -- -Y node1
6	(the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
7
8	Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
9	- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
10	- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost\|10.211.55.101\|node1.
11
12	-------------------------
13
14
15	WET example from https://github.com/commoncrawl/cc-warc-examples
16
17	vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
18	vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
19	vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
20	Found 1 items
21	-rw-r--r-- 1 vagrant supergroup 154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
22	vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
23
24	<ONCE FINISHED:>
25
26	vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
27
28
29
30	INFO ON HADOOP/HDFS:
31	https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
32
33	SPARK:
34	configure option example: https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions
35
36
37
38	LIKE '%isl%'
39
40	cd cc-index-table
41	APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
42	> $SPARK_HOME/bin/spark-submit \
43	# $SPARK_ON_YARN \
44	--conf spark.hadoop.parquet.enable.dictionary=true \
45	--conf spark.hadoop.parquet.enable.summary-metadata=false \
46	--conf spark.sql.hive.metastorePartitionPruning=true \
47	--conf spark.sql.parquet.filterPushdown=true \
48	--conf spark.sql.parquet.mergeSchema=true \
49	--class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \
50	--query "SELECT url, warc_filename, warc_record_offset, warc_record_length
51	FROM ccindex
52	WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
53	--numOutputPartitions 12 \
54	--numRecordsPerWarcFile 20000 \
55	--warcPrefix ICELANDIC-CC-2018-43 \
56	s3://commoncrawl/cc-index/table/cc-main/warc/ \
57	.../my_output_path/
58
59
60	----
61	TIME
62	----
63	1. https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling
64	http://digitalpebble.blogspot.com/2017/03/need-billions-of-web-pages-dont-bother_29.html
65
66	"So, not only have CommonCrawl given you loads of web data for free, theyâve also made your life easier by preprocessing the data for you. For many tasks, the content of the WAT or WET files will be sufficient and you wonât have to process the WARC files.
67
68	This should not only help you simplify your code but also make the whole processing faster. We recently ran an experiment on CommonCrawl where we needed to extract anchor text from HTML pages. We initially wrote some MapReduce code to extract the binary content of the pages from their WARC representation, processed the HTML with JSoup and reduced on the anchor text. Processing a single WARC segment took roughly 100 minutes on a 10-node EMR cluster. We then simplified the extraction logic, took the WAT files as input and the processing time dropped to 17 minutes on the same cluster. This gain was partly due to not having to parse the web pages, but also to the fact that WAT files are a lot smaller than their WARC counterparts."
69
70	2. https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
71	"Both the parsing part and the processing part take just a couple of minutes per index file / WET file - the bulk of the âcomputeâ lies within actually downloading these files.
72
73	Essentially if you have some time to spare and an unlimited Internet connection, all of this processing can be done on one powerful machine. You can be fancy and go ahead and rent some Amazon server(s) to minimize the download time, but that can be costly.
74
75	In my experience - parsing the whole index for Russian websites (just filtering by language) takes approximately 140 hours - but the majority of this time is just downloading (my speed averaged ~300-500 kb/s)."
76
77	=========================================================
78	Configuring spark to work on Amazon AWS s3a dataset:
79	=========================================================
80	https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
81	http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/
82	https://answers.dataiku.com/1734/common-crawl-s3
83	https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir
84	https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml
85
86	https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w
87
88
89	https://sparkour.urizone.net/recipes/using-s3/
90	Configuring Spark to Use Amazon S3
91	"Some Spark tutorials show AWS access keys hardcoded into the file paths. This is a horribly insecure approach and should never be done. Use exported environment variables or IAM Roles instead, as described in Configuring Amazon S3 as a Spark Data Source."
92
93	"No FileSystem for scheme: s3n
94
95	java.io.IOException: No FileSystem for scheme: s3n
96
97	This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you can use the --packages parameter and Spark will use Maven to locate the missing dependencies and distribute them to the cluster. Alternately, you can use --jars if you manually downloaded the dependencies already. These parameters also works on the spark-submit script."
98
99	===========================================
100	IAM Role (or user) and commoncrawl profile
101	===========================================
102
103	"iam" role or user for commoncrawl(er) profile
104
105
106	aws management console:
107	[email protected]
108	lab pwd, capital R and ! (maybe g)
109
110	commoncrawl profile created while creating the user/role, by following the instructions at: https://answers.dataiku.com/1734/common-crawl-s3
111
112	<!--
113	<property>
114	<name>fs.s3a.awsAccessKeyId</name>
115	<value>XXX</value>
116	</property>
117	<property>
118	<name>fs.s3a.awsSecretAccessKey</name>
119	<value>XXX</value>
120	</property>
121	-->
122
123
124	But instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml)
125
126	you'll want to put the Amazon AWS access key and secret key in the spark properties file:
127
128	sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
129
130
131	The spark properties should contain:
132
133	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
134	spark.hadoop.fs.s3a.access.key=ACCESSKEY
135	spark.hadoop.fs.s3a.secret.key=SECRETKEY
136
137
138
139	When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me, since I forwarded the vagrant VM's ports at +1)
140
141	-------------
142
143	APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
144	$SPARK_HOME/bin/spark-submit \
145	--conf spark.hadoop.parquet.enable.dictionary=true \
146	--conf spark.hadoop.parquet.enable.summary-metadata=false \
147	--conf spark.sql.hive.metastorePartitionPruning=true \
148	--conf spark.sql.parquet.filterPushdown=true \
149	--conf spark.sql.parquet.mergeSchema=true \
150	--class org.commoncrawl.spark.examples.CCIndexExport $APPJAR \
151	--query "SELECT url, warc_filename, warc_record_offset, warc_record_length
152	FROM ccindex
153	WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
154	--outputFormat csv \
155	--numOutputPartitions 10 \
156	--outputCompression gzip \
157	s3://commoncrawl/cc-index/table/cc-main/warc/ \
158	hdfs:///user/vagrant/cc-mri-csv
159
160	----------------
161	Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
162
163
164	https://stackoverflow.com/questions/39355354/spark-no-filesystem-for-scheme-https-cannot-load-files-from-amazon-s3
165	https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3
166	"2018-01-10 Update Hadoop 3.0 has cut its s3: and s3n implementations: s3a is all you get. It is now significantly better than its predecessor and performs as least as good as the Amazon implementation. Amazon's "s3:" is still offered by EMR, which is their closed source client. Consult the EMR docs for more info."
167
168	1. https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
169
170	"Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
171
172	Here are the key parts, as of December 2015:
173
174	Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
175
176	You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
177
178	You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
179
180	In spark.properties you probably want some settings that look like this:
181
182	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
183	spark.hadoop.fs.s3a.access.key=ACCESSKEY
184	spark.hadoop.fs.s3a.secret.key=SECRETKEY
185
186	I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them."
187
188
189	2. The classpath used by hadoop can be found by running the command (https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath):
190	hadoop classpath
191
192
193	3. Got hadoop-aws 2.7.6 jar
194	from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
195	and put it into /home/vagrant
196
197
198	4. https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath
199	https://stackoverflow.com/questions/28520821/how-to-add-external-jar-to-hadoop-job/54459211#54459211
200	vagrant@node1:~$ export LIBJARS=/home/vagrant/hadoop-aws-2.7.6.jar
201	vagrant@node1:~$ export HADOOP_CLASSPATH=`echo ${LIBJARS} \| sed s/,/:/g`
202	vagrant@node1:~$ hadoop classpath
203
204	5. https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
205	"Download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory. You should see the aws-java-sdk-1.10.65.jar in /usr/hdp/2.4.0.0-169/hadoop/"
206
207	I got version 1.11
208
209	[Can't find a spark.properties file, but this seems to contain spark specific properties:
210	$SPARK_HOME/conf/spark-defaults.conf
211
212	https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-properties.html
213	"The default Spark properties file is $SPARK_HOME/conf/spark-defaults.conf that could be overriden using spark-submit with the --properties-file command-line option."]
214
215	Can SUDO COPY the 2 jar files hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar to:
216	/usr/local/hadoop/share/hadoop/common/
217	(else /usr/local/hadoop/share/hadoop/hdfs/hadoop-aws-2.7.6.jar)
218
219	--------
220	schema
221	https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
222
223	---------------
224	More examples to try:
225	https://github.com/commoncrawl/cc-warc-examples
226
227
228	A bit outdated?
229	https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
230	https://www.journaldev.com/20261/apache-spark
231
232	--------
233
234	sudo apt-get install maven
235	(or sudo apt update
236	sudo apt install maven)
237	git clone https://github.com/commoncrawl/cc-index-table.git
238	cd cc-index-table
239	mvn package
240	vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
241
242
243
244
245	spark:
246	https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
247
248	============
249	Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
250
251	https://github.com/martinprobson/vagrant-hadoop-hive-spark
252
253	Vagrant:
254	* Guide: https://www.vagrantup.com/intro/getting-started/index.html
255	* Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
256	* vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
257	* https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
258	* https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
259	sudo apt-get -y install firefox
260	* vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
261
262	* hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
263	* https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
264	---
265	==> node1: Forwarding ports...
266	node1: 8080 (guest) => 8081 (host) (adapter 1)
267	node1: 8088 (guest) => 8089 (host) (adapter 1)
268	node1: 9083 (guest) => 9084 (host) (adapter 1)
269	node1: 4040 (guest) => 4041 (host) (adapter 1)
270	node1: 18888 (guest) => 18889 (host) (adapter 1)
271	node1: 16010 (guest) => 16011 (host) (adapter 1)
272	node1: 22 (guest) => 2200 (host) (adapter 1)
273	==> node1: Running 'pre-boot' VM customizations...
274
275
276	==> node1: Checking for guest additions in VM...
277	node1: The guest additions on this VM do not match the installed version of
278	node1: VirtualBox! In most cases this is fine, but in rare cases it can
279	node1: prevent things such as shared folders from working properly. If you see
280	node1: shared folder errors, please make sure the guest additions within the
281	node1: virtual machine match the version of VirtualBox you have installed on
282	node1: your host and reload your VM.
283	node1:
284	node1: Guest Additions Version: 5.1.38
285	node1: VirtualBox Version: 5.2
286
287	------------

Note: See TracBrowser for help on using the repository browser.

Download in other formats: