Context Navigation

Vagrant-Spark-Hadoop.txt@ 33675

Last change on this file since 33675 was 33545, checked in by ak19, 5 years ago

Mainly changes to crawling-Nutch.txt and some minor changes to other txt files. crawling-Nutch.txt now documents my attempts to successfully run nutch v2 on the davidb homepage site and crawl it entirely and dump the text output into the local or hadoop filesystem. I also ran 2 different numbers of nutch cycles (generate-fetch-parse-updatedb) to download the site: 10 cycles and 15 cycles. I paid attention to the output the second time, it stopped after 6 cycles saying there was nothing new to fetch. So it seems to have a built-in termination test, allowing site mirroring. Running readdb with the -stats flag allowed me to check that both times, it downloaded 44 URLs.

File size: 41.1 KB

Line
1	Hadoop/Map-reduce
2
3	https://www.guru99.com/create-your-first-hadoop-program.html
4
5	Some Hadoop commands
6	* https://community.cloudera.com/t5/Support-Questions/Closed-How-to-store-output-of-shell-script-in-HDFS/td-p/229933
7	* https://stackoverflow.com/questions/26513861/checking-if-directory-in-hdfs-already-exists-or-not
8	--------------
9	To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
10	1. ssh analytics -Y
11	2. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
12	or
13	vagrant ssh -- -Y node1
14	(the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
15
16	Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
17	- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
18	- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost\|10.211.55.101\|node1.
19
20	===========================================
21	WARC TO WET
22	===========================================
23	https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
24
25	Sebastian Nagel
26	05/07/2017
27	Hi,
28
29	unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
30
31	But it's easy to run the WET extractor on the WARC files, see:
32	https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
33	https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion
34
35	That's what you have to do:
36
37	# download the WARC files and place them in a directory "warc/"
38	# create sibling folders wat and wet
39	# \|
40	# \|-- warc/
41	# \| \|-- CC-NEWS-20161001224340-00008.warc.gz
42	# \| \|-- CC-NEWS-20161017145313-00000.warc.gz
43	# \| `-- ...
44	# \|
45	# \|-- wat/
46	# \|
47	# `-- wet/
48
49	git clone https://github.com/commoncrawl/ia-web-commons
50	cd ia-web-commons
51	mvn install
52
53	cd ..
54	git clone https://github.com/commoncrawl/ia-hadoop-tools
55	cd ia-hadoop-tools
56	mvn package
57
58	java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
59	-strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz
60
61	The folders wat/ and wet/ will then contain the exports.
62
63	Best,
64	Sebastian
65
66	---
67
68	1. So following the above instructions, I first made a warc subfolder in hdfs:///user/vagrant/cc-mri-subset/
69	Then moved all the downloaded *warc.gz into there.
70	Then created wat and wet subfolders in there alongside the warc folder.
71
72	2. Next, I did the 2 git clone and mvn compile operations above.
73	The first, ia-web-commons, successfully compiled (despite some test failures)
74
75	3. BUT THE 2ND GIT PROJECT, ia-hadoop-tools, DIDN'T COMPILE AT FIRST, with the mvn package command failing:
76
77	git clone https://github.com/commoncrawl/ia-hadoop-tools
78	cd ia-hadoop-tools
79	mvn package
80
81	Compile failed with a message about the JSONTokener constructor not taking a String object. It turned out that the JSONTokener used was a version of the class that was too old. Whereas the necessary constructor is present in the most recent version, as seen in the API at https://docs.oracle.com/cd/E51273_03/ToolsAndFrameworks.110/apidoc/assembler/com/endeca/serialization/json/JSONTokener.html
82
83	So instead, I opened up ia-hadoop-tools/pom.xml for editing and added the newest version of the org.json package's json (see http://builds.archive.org/maven2/org/json/json/ for <version>) into the pom.xml's <dependencies> element, based on how it this was done at https://bukkit.org/threads/problem-loading-libraries-with-maven-java-lang-noclassdeffounderror-org-json-jsonobject.428354/:
84
85	<dependency>
86	<groupId>org.json</groupId>
87	<artifactId>json</artifactId>
88	<version>20131018</version>
89	</dependency>
90
91	Then I was able to run "mvn package" successfully.
92	(Maybe I could also have added in a far more recent version, as seen in the version numbers at https://mvnrepository.com/artifact/org.json/json,
93	but didn't want to go too far ahead in case there was other incompatibility.)
94
95	4. Next, I wanted to finally run the built executable to convert the warc files to wet files.
96
97	I had the warc files on the hadoop filesystem. The original instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to however were apparently for working with warcs stored on the local filesystem, as those instructions did not run the hadoop command but the regular java command. The regular java command did not work with the files being on the hadoop system (attempt #1 below).
98
99	ATTEMPTS THAT DIDN'T WORK:
100	1. vagrant@node1:~/ia-hadoop-tools$ java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
101	2. vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
102
103
104	The 2nd attempt, which uses a proper hadoop command, I based off https://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm
105	It produced lots of errors and the output wet (and wat) .gz files were all corrupt as gunzip could not successfully run over them:
106
107	vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
108	19/09/05 05:57:22 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
109	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100139-000000.warc.gz
110	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100141-000001.warc.gz
111	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100451-000002.warc.gz
112	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100453-000003.warc.gz
113	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100805-000004.warc.gz
114	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100809-000005.warc.gz
115	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000006.warc.gz
116	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000007.warc.gz
117	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000008.warc.gz
118	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000009.warc.gz
119	19/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
120	19/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
121	19/09/05 05:57:23 INFO mapred.FileInputFormat: Total input paths to process : 10
122	19/09/05 05:57:24 INFO mapreduce.JobSubmitter: number of splits:10
123	19/09/05 05:57:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1567397114047_0001
124	19/09/05 05:57:24 INFO impl.YarnClientImpl: Submitted application application_1567397114047_0001
125	19/09/05 05:57:24 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1567397114047_0001/
126	19/09/05 05:57:24 INFO mapreduce.Job: Running job: job_1567397114047_0001
127	19/09/05 05:57:31 INFO mapreduce.Job: Job job_1567397114047_0001 running in uber mode : false
128	19/09/05 05:57:31 INFO mapreduce.Job: map 0% reduce 0%
129	19/09/05 05:57:44 INFO mapreduce.Job: map 10% reduce 0%
130	19/09/05 05:57:44 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000002_0, Status : FAILED
131	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
132	Container killed by the ApplicationMaster.
133	Container killed on request. Exit code is 143
134	Container exited with a non-zero exit code 143
135
136	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000004_0, Status : FAILED
137	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
138	Container killed by the ApplicationMaster.
139	Container killed on request. Exit code is 143
140	Container exited with a non-zero exit code 143
141
142	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
143	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
144	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000005_0, Status : FAILED
145	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
146	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000000_0, Status : FAILED
147	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
148	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000003_0, Status : FAILED
149	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
150	19/09/05 05:57:46 INFO mapreduce.Job: map 0% reduce 0%
151	19/09/05 05:57:54 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000007_0, Status : FAILED
152	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
153	19/09/05 05:57:55 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000006_0, Status : FAILED
154	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
155	19/09/05 05:57:57 INFO mapreduce.Job: map 10% reduce 0%
156	19/09/05 05:57:57 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000009_0, Status : FAILED
157	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
158	Container killed by the ApplicationMaster.
159	Container killed on request. Exit code is 143
160	Container exited with a non-zero exit code 143
161
162	19/09/05 05:57:58 INFO mapreduce.Job: map 20% reduce 0%
163	19/09/05 05:57:58 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000008_0, Status : FAILED
164	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
165	19/09/05 05:58:06 INFO mapreduce.Job: map 30% reduce 0%
166	19/09/05 05:58:08 INFO mapreduce.Job: map 60% reduce 0%
167	19/09/05 05:58:09 INFO mapreduce.Job: map 70% reduce 0%
168	19/09/05 05:58:10 INFO mapreduce.Job: map 80% reduce 0%
169	19/09/05 05:58:12 INFO mapreduce.Job: map 90% reduce 0%
170	19/09/05 05:58:13 INFO mapreduce.Job: map 100% reduce 0%
171	19/09/05 05:58:13 INFO mapreduce.Job: Job job_1567397114047_0001 completed successfully
172	19/09/05 05:58:13 INFO mapreduce.Job: Counters: 32
173	File System Counters
174	FILE: Number of bytes read=0
175	FILE: Number of bytes written=1239360
176	FILE: Number of read operations=0
177	FILE: Number of large read operations=0
178	FILE: Number of write operations=0
179	HDFS: Number of bytes read=1430
180	HDFS: Number of bytes written=0
181	HDFS: Number of read operations=30
182	HDFS: Number of large read operations=0
183	HDFS: Number of write operations=0
184	Job Counters
185	Failed map tasks=10
186	Launched map tasks=20
187	Other local map tasks=10
188	Data-local map tasks=10
189	Total time spent by all maps in occupied slots (ms)=208160
190	Total time spent by all reduces in occupied slots (ms)=0
191	Total time spent by all map tasks (ms)=208160
192	Total vcore-milliseconds taken by all map tasks=208160
193	Total megabyte-milliseconds taken by all map tasks=213155840
194	Map-Reduce Framework
195	Map input records=10
196	Map output records=0
197	Input split bytes=1430
198	Spilled Records=0
199	Failed Shuffles=0
200	Merged Map outputs=0
201	GC time elapsed (ms)=1461
202	CPU time spent (ms)=2490
203	Physical memory (bytes) snapshot=1564528640
204	Virtual memory (bytes) snapshot=19642507264
205	Total committed heap usage (bytes)=1126170624
206	File Input Format Counters
207	Bytes Read=0
208	File Output Format Counters
209	Bytes Written=0
210	vagrant@node1:~/ia-hadoop-tools$
211
212
213	5. The error messages are all the same but not very informative
214	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
215	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
216
217	All the references I could find on google indicated that the full version of the error message was that this method (com.google.common.io.ByteStreams.limit(...)) could not be located.
218	The page at http://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%[email protected]%3E
219	revealed that guava.jar contains the com.google.common.io.ByteStreams class.
220
221
222	TO GET THE EXECUTABLE TO WORK:
223	I located guava.jar, found there were 2 identical ones on the filesystem but that neither was on the hadoop classpath yet, so I copied it into one of the Hadoop Classpath locations. Then I was able to successfully run the executable and produce meaningful WET files at last from the WARC input files:
224
225
226	vagrant@node1:~$ locate guava.jar
227	/usr/share/java/guava.jar
228	/usr/share/maven/lib/guava.jar
229	vagrant@node1:~$ jar -tvf /usr/share/maven/lib/guava.jar \| less
230	vagrant@node1:~$ jar -tvf /usr/share/java/guava.jar \| less
231	# both contained the ByteStreams class
232
233	vagrant@node1:~$ cd -
234	/home/vagrant/ia-hadoop-tools
235	vagrant@node1:~/ia-hadoop-tools$ find . -name "guava.jar"
236	# None in the git project
237
238	vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
239	/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/:/usr/local/hadoop/share/hadoop/common/:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/:/usr/local/hadoop/share/hadoop/hdfs/:/usr/local/hadoop/share/hadoop/yarn/lib/:/usr/local/hadoop/share/hadoop/yarn/:/usr/local/hadoop/share/hadoop/mapreduce/lib/:/usr/local/hadoop/share/hadoop/mapreduce/:/contrib/capacity-scheduler/*.jar
240	# guava.jar not on hadoop classpath yet
241
242	vagrant@node1:~/ia-hadoop-tools$ diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
243	# no differences, identical
244
245	vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -put /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
246	put: `/usr/local/hadoop/share/hadoop/common/.': No such file or directory
247	# hadoop classpath locations are not on the hdfs filesystem, but on the regular fs
248
249	vagrant@node1:~/ia-hadoop-tools$ sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
250	vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
251	/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/:/usr/local/hadoop/share/hadoop/common/:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/:/usr/local/hadoop/share/hadoop/hdfs/:/usr/local/hadoop/share/hadoop/yarn/lib/:/usr/local/hadoop/share/hadoop/yarn/:/usr/local/hadoop/share/hadoop/mapreduce/lib/:/usr/local/hadoop/share/hadoop/mapreduce/:/contrib/capacity-scheduler/*.jar
252	# Copied guava.jar to somewhere on existing hadoop classpath
253
254	vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
255	# Successful run
256
257	vagrant@node1:~$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/.
258	vagrant@node1:~$ cd ..
259	vagrant@node1:~$ gunzip MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz
260	vagrant@node1:~$ less MAORI-CC-2019-30-20190902100139-000000.warc.wet
261	# Copied a WET output file from the hadoop filesystem to local filesystem and inspected its contents. Works!
262
263	-----------------------------------
264	VIEW THE MRI-ONLY INDEX GENERATED
265	-----------------------------------
266	hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* \| tail -5
267
268	(gz archive, binary file)
269
270	vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -mkdir hdfs:///user/vagrant/cc-mri-unzipped-csv
271
272	# https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop
273	XXX vagrant@node1:~/cc-index-table/src/script$ hadoop fs -cat hdfs:///user/vagrant/cc-mri-csv/part* \| gzip -d \| hadoop fs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv
274
275
276	vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* \| gzip -d \| hdfs dfs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv
277	vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -ls hdfs:///user/vagrant/cc-mri-unzipped-csv
278	Found 1 items
279	-rw-r--r-- 1 vagrant supergroup 71664603 2019-08-29 04:47 hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv
280
281	# https://stackoverflow.com/questions/14925323/view-contents-of-file-in-hdfs-hadoop
282	vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv \| tail -5
283
284	# url, warc_filename, warc_record_offset, warc_record_length
285	http://paupauocean.com/page91?product_id=142&brd=1,crawl-data/CC-MAIN-2019-30/segments/1563195526940.0/warc/CC-MAIN-20190721082354-20190721104354-00088.warc.gz,115081770,21404
286	https://cookinseln-reisen.de/cook-inseln/rarotonga/,crawl-data/CC-MAIN-2019-30/segments/1563195526799.4/warc/CC-MAIN-20190720235054-20190721021054-00289.warc.gz,343512295,12444
287	http://www.halopharm.com/mi/profile/,crawl-data/CC-MAIN-2019-30/segments/1563195525500.21/warc/CC-MAIN-20190718042531-20190718064531-00093.warc.gz,219160333,10311
288	https://www.firstpeople.us/pictures/green/Touched-by-the-hand-of-Time-1907.html,crawl-data/CC-MAIN-2019-30/segments/1563195526670.1/warc/CC-MAIN-20190720194009-20190720220009-00362.warc.gz,696195242,5408
289	https://www.sos-accessoire.com/programmateur-programmateur-module-electronique-whirlpool-481231028062-27573.html,crawl-data/CC-MAIN-2019-30/segments/1563195527048.80/warc/CC-MAIN-20190721144008-20190721170008-00164.warc.gz,830087190,26321
290
291	# https://stackoverflow.com/questions/32612867/how-to-count-lines-in-a-file-on-hdfs-command
292	vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv \| wc -l
293	345625
294
295
296	ANOTHER WAY (DR BAINBRIDGE'S WAY) TO CREATE SINGLE .CSV FILE FROM /part* FILES AND VIEW ITS CONTENTS:
297	vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* > file.csv.gz
298	vagrant@node1:~/cc-index-table$ less file.csv.gz
299
300
301	https://www.patricia-anong.com/blog/2017/11/1/extend-vmdk-on-virtualbox
302
303
304	When not using LIKE '%mri%' but = 'mri' instead:
305	vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv \| wc -l
306	5767
307
308
309	For a month later, the August 2019 crawl:
310	vagrant@node1:~$ hdfs dfs -cat hdfs:///user/vagrant/CC-MAIN-2019-35/cc-mri-unzipped-csv/cc-mri.csv \| wc -l
311	9318
312
313	-----------------------------------------
314	Running export_mri_subset.sh
315	-----------------------------------------
316
317	The export_mri_subset.sh script is set up run on the csv input file produced by running export_mri_index_csv.sh
318
319	Running this initially produced the following exception:
320
321
322	2019-08-29 05:48:52 INFO CCIndexExport:152 - Number of records/rows matched by query: 345624
323	2019-08-29 05:48:52 INFO CCIndexExport:157 - Distributing 345624 records to 70 output partitions (max. 5000 records per WARC file)
324	2019-08-29 05:48:52 INFO CCIndexExport:165 - Repartitioning data to 70 output partitions
325	Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`url`' given input columns: [http://176.31.110.213:600/?p=287, crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz, 1215489, 15675];;
326	'Project ['url, 'warc_filename, 'warc_record_offset, 'warc_record_length]
327	+- AnalysisBarrier
328	+- Repartition 70, true
329	+- Relation[http://176.31.110.213:600/?p=287#10,crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz#11,1215489#12,15675#13] csv
330
331	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
332	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
333	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
334	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
335	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
336	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
337	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
338	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
339	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
340	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
341	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
342	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
343	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
344	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
345	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
346	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
347	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
348	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
349	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
350	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
351	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
352	at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
353	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
354	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
355	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
356	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
357	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
358	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
359	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
360	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
361	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
362	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
363	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
364	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3295)
365	at org.apache.spark.sql.Dataset.select(Dataset.scala:1307)
366	at org.apache.spark.sql.Dataset.select(Dataset.scala:1325)
367	at org.apache.spark.sql.Dataset.select(Dataset.scala:1325)
368	at org.commoncrawl.spark.examples.CCIndexWarcExport.run(CCIndexWarcExport.java:169)
369	at org.commoncrawl.spark.examples.CCIndexExport.run(CCIndexExport.java:192)
370	at org.commoncrawl.spark.examples.CCIndexWarcExport.main(CCIndexWarcExport.java:214)
371	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
372	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
373	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
374	at java.lang.reflect.Method.invoke(Method.java:498)
375	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
376	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
377	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
378	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
379	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
380	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
381	2019-08-29 05:48:52 INFO SparkContext:54 - Invoking stop() from shutdown hook
382
383
384
385	Hints to solve it were at https://stackoverflow.com/questions/45972929/scala-dataframereader-keep-column-headers
386	The actual solution is to edit the CCIndexWarcExport.java as follows:
387	1. set option(header) to false since the csv file contains no header row, only data rows. You can confirm the csv has no header row by doing
388	hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* \| head -5
389
390	2. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
391
392	emacs src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java
393
394	Change:
395	sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
396	.load(csvQueryResult);
397	To
398	sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
399	.load(csvQueryResult);
400
401	And comment out:
402	//JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
403	.toJavaRDD();
404	Replace with the default inferred column names:
405	JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
406	.toJavaRDD();
407
408
409	Now recompile:
410	mvn package
411
412	And run:
413	./src/script/export_mri_subset.sh
414
415	-------------------------
416
417	WET example from https://github.com/commoncrawl/cc-warc-examples
418
419	vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
420	vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
421	vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
422	Found 1 items
423	-rw-r--r-- 1 vagrant supergroup 154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
424	vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
425
426	<ONCE FINISHED:>
427
428	vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
429
430
431
432	INFO ON HADOOP/HDFS:
433	https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
434
435	SPARK:
436	configure option example: https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions
437
438
439
440	LIKE '%isl%'
441
442	cd cc-index-table
443	APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
444	> $SPARK_HOME/bin/spark-submit \
445	# $SPARK_ON_YARN \
446	--conf spark.hadoop.parquet.enable.dictionary=true \
447	--conf spark.hadoop.parquet.enable.summary-metadata=false \
448	--conf spark.sql.hive.metastorePartitionPruning=true \
449	--conf spark.sql.parquet.filterPushdown=true \
450	--conf spark.sql.parquet.mergeSchema=true \
451	--class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \
452	--query "SELECT url, warc_filename, warc_record_offset, warc_record_length
453	FROM ccindex
454	WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
455	--numOutputPartitions 12 \
456	--numRecordsPerWarcFile 20000 \
457	--warcPrefix ICELANDIC-CC-2018-43 \
458	s3://commoncrawl/cc-index/table/cc-main/warc/ \
459	.../my_output_path/
460
461
462	----
463	TIME
464	----
465	1. https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling
466	http://digitalpebble.blogspot.com/2017/03/need-billions-of-web-pages-dont-bother_29.html
467
468	"So, not only have CommonCrawl given you loads of web data for free, theyâve also made your life easier by preprocessing the data for you. For many tasks, the content of the WAT or WET files will be sufficient and you wonât have to process the WARC files.
469
470	This should not only help you simplify your code but also make the whole processing faster. We recently ran an experiment on CommonCrawl where we needed to extract anchor text from HTML pages. We initially wrote some MapReduce code to extract the binary content of the pages from their WARC representation, processed the HTML with JSoup and reduced on the anchor text. Processing a single WARC segment took roughly 100 minutes on a 10-node EMR cluster. We then simplified the extraction logic, took the WAT files as input and the processing time dropped to 17 minutes on the same cluster. This gain was partly due to not having to parse the web pages, but also to the fact that WAT files are a lot smaller than their WARC counterparts."
471
472	2. https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
473	"Both the parsing part and the processing part take just a couple of minutes per index file / WET file - the bulk of the âcomputeâ lies within actually downloading these files.
474
475	Essentially if you have some time to spare and an unlimited Internet connection, all of this processing can be done on one powerful machine. You can be fancy and go ahead and rent some Amazon server(s) to minimize the download time, but that can be costly.
476
477	In my experience - parsing the whole index for Russian websites (just filtering by language) takes approximately 140 hours - but the majority of this time is just downloading (my speed averaged ~300-500 kb/s)."
478
479	----
480	CMDS
481	----
482	https://stackoverflow.com/questions/29565716/spark-kill-running-application
483
484	=========================================================
485	Configuring spark to work on Amazon AWS s3a dataset:
486	=========================================================
487	https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
488	http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/
489	https://answers.dataiku.com/1734/common-crawl-s3
490	https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir
491	https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml
492
493	https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w
494
495
496	https://sparkour.urizone.net/recipes/using-s3/
497	Configuring Spark to Use Amazon S3
498	"Some Spark tutorials show AWS access keys hardcoded into the file paths. This is a horribly insecure approach and should never be done. Use exported environment variables or IAM Roles instead, as described in Configuring Amazon S3 as a Spark Data Source."
499
500	"No FileSystem for scheme: s3n
501
502	java.io.IOException: No FileSystem for scheme: s3n
503
504	This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you can use the --packages parameter and Spark will use Maven to locate the missing dependencies and distribute them to the cluster. Alternately, you can use --jars if you manually downloaded the dependencies already. These parameters also works on the spark-submit script."
505
506	===========================================
507	IAM Role (or user) and commoncrawl profile
508	===========================================
509
510	"iam" role or user for commoncrawl(er) profile
511
512
513	aws management console:
514	[email protected]
515	lab pwd, capital R and ! (maybe g)
516
517	commoncrawl profile created while creating the user/role, by following the instructions at: https://answers.dataiku.com/1734/common-crawl-s3
518
519	https://answers.dataiku.com/1734/common-crawl-s3
520	Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user:
521	#### START JSON (POLICY) ###
522	{
523	"Version": "2012-10-17",
524	"Statement": [
525	{
526	"Sid": "Stmt1503647467000",
527	"Effect": "Allow",
528	"Action": [
529	"s3:GetObject",
530	"s3:ListBucket"
531	],
532	"Resource": [
533	"arn:aws:s3:::commoncrawl/*",
534	"arn:aws:s3:::commoncrawl"
535	]
536	}
537	]
538	}
539	#### END ###
540
541	<!--
542	<property>
543	<name>fs.s3a.awsAccessKeyId</name>
544	<value>XXX</value>
545	</property>
546	<property>
547	<name>fs.s3a.awsSecretAccessKey</name>
548	<value>XXX</value>
549	</property>
550	-->
551
552
553	[If accesskey and secret were specified in hadoop core-site.xml and not in spark conf props file, then running export_maori_index_csv.sh produced the following error:
554
555	2019-08-29 06:16:38 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
556	2019-08-29 06:16:40 WARN FileStreamSink:66 - Error while looking for metadata directory.
557	Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
558	at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
559	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
560	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
561	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
562	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
563	]
564
565	Instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml)
566
567	you'll want to put the Amazon AWS access key and secret key in the spark properties file:
568
569	sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
570
571
572	The spark properties conf file above should contain:
573
574	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
575	spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY
576	spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY
577
578
579
580	When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
581
582	-------------
583
584	APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
585	$SPARK_HOME/bin/spark-submit \
586	--conf spark.hadoop.parquet.enable.dictionary=true \
587	--conf spark.hadoop.parquet.enable.summary-metadata=false \
588	--conf spark.sql.hive.metastorePartitionPruning=true \
589	--conf spark.sql.parquet.filterPushdown=true \
590	--conf spark.sql.parquet.mergeSchema=true \
591	--class org.commoncrawl.spark.examples.CCIndexExport $APPJAR \
592	--query "SELECT url, warc_filename, warc_record_offset, warc_record_length
593	FROM ccindex
594	WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
595	--outputFormat csv \
596	--numOutputPartitions 10 \
597	--outputCompression gzip \
598	s3://commoncrawl/cc-index/table/cc-main/warc/ \
599	hdfs:///user/vagrant/cc-mri-csv
600
601	----------------
602	Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
603
604
605	https://stackoverflow.com/questions/39355354/spark-no-filesystem-for-scheme-https-cannot-load-files-from-amazon-s3
606	https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3
607	"2018-01-10 Update Hadoop 3.0 has cut its s3: and s3n implementations: s3a is all you get. It is now significantly better than its predecessor and performs as least as good as the Amazon implementation. Amazon's "s3:" is still offered by EMR, which is their closed source client. Consult the EMR docs for more info."
608
609	1. https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
610
611	"Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
612
613	Here are the key parts, as of December 2015:
614
615	Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
616
617	You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
618
619	You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
620
621	In spark.properties you probably want some settings that look like this:
622
623	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
624	spark.hadoop.fs.s3a.access.key=ACCESSKEY
625	spark.hadoop.fs.s3a.secret.key=SECRETKEY
626
627	I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them."
628
629
630	2. The classpath used by hadoop can be found by running the command (https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath):
631	hadoop classpath
632
633
634	3. Got hadoop-aws 2.7.6 jar
635	from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
636	and put it into /home/vagrant
637
638
639	4. https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath
640	https://stackoverflow.com/questions/28520821/how-to-add-external-jar-to-hadoop-job/54459211#54459211
641	vagrant@node1:~$ export LIBJARS=/home/vagrant/hadoop-aws-2.7.6.jar
642	vagrant@node1:~$ export HADOOP_CLASSPATH=`echo ${LIBJARS} \| sed s/,/:/g`
643	vagrant@node1:~$ hadoop classpath
644
645	5. https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
646	"Download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory. You should see the aws-java-sdk-1.10.65.jar in /usr/hdp/2.4.0.0-169/hadoop/"
647
648	I got version 1.11
649
650	[Can't find a spark.properties file, but this seems to contain spark specific properties:
651	$SPARK_HOME/conf/spark-defaults.conf
652
653	https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-properties.html
654	"The default Spark properties file is $SPARK_HOME/conf/spark-defaults.conf that could be overriden using spark-submit with the --properties-file command-line option."]
655
656	Can SUDO COPY the 2 jar files hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar to:
657	/usr/local/hadoop/share/hadoop/common/
658	(else /usr/local/hadoop/share/hadoop/hdfs/hadoop-aws-2.7.6.jar)
659
660	--------
661	schema
662	https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
663
664	---------------
665	More examples to try:
666	https://github.com/commoncrawl/cc-warc-examples
667
668
669	A bit outdated?
670	https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
671	https://www.journaldev.com/20261/apache-spark
672
673	--------
674
675	sudo apt-get install maven
676	(or sudo apt update
677	sudo apt install maven)
678	git clone https://github.com/commoncrawl/cc-index-table.git
679	cd cc-index-table
680	mvn package
681	vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
682
683
684
685
686	spark:
687	https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
688
689	============
690	Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
691
692	https://github.com/martinprobson/vagrant-hadoop-hive-spark
693
694	Vagrant:
695	* Guide: https://www.vagrantup.com/intro/getting-started/index.html
696	* Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
697	* vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
698	* https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
699	* https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
700	sudo apt-get -y install firefox
701	* vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
702
703	* hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
704	* https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
705	---
706	==> node1: Forwarding ports...
707	node1: 8080 (guest) => 8081 (host) (adapter 1)
708	node1: 8088 (guest) => 8089 (host) (adapter 1)
709	node1: 9083 (guest) => 9084 (host) (adapter 1)
710	node1: 4040 (guest) => 4041 (host) (adapter 1)
711	node1: 18888 (guest) => 18889 (host) (adapter 1)
712	node1: 16010 (guest) => 16011 (host) (adapter 1)
713	node1: 22 (guest) => 2200 (host) (adapter 1)
714	==> node1: Running 'pre-boot' VM customizations...
715
716
717	==> node1: Checking for guest additions in VM...
718	node1: The guest additions on this VM do not match the installed version of
719	node1: VirtualBox! In most cases this is fine, but in rare cases it can
720	node1: prevent things such as shared folders from working properly. If you see
721	node1: shared folder errors, please make sure the guest additions within the
722	node1: virtual machine match the version of VirtualBox you have installed on
723	node1: your host and reload your VM.
724	node1:
725	node1: Guest Additions Version: 5.1.38
726	node1: VirtualBox Version: 5.2
727
728	------------

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: other-projects/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt@ 33675

Download in other formats: