Context Navigation

Vagrant-Spark-Hadoop.txt@ 33467

Last change on this file since 33467 was 33467, checked in by ak19, 5 years ago

Improved the code to use a static block to load the needed properties from config.properties and initialise some static final ints from there. Code now uses the logger for debugging. New properties in config.properties. Returned code to use a counter, recordCount, re-zeroed for each WETProcessor since the count was used for unique filenames, and filename prefixes are unique for each warc.wet file. So these prefixes, in combination with keeping track of the recordcount per warc.wet file, each WET record written out to a file is assigned a unique filename. (No longer need a running total of all WET records across warc.wet files processed ensuring uniqueness of filenames.) All appears to still work similarly to previous commit in creating discard and keep subfolders.

File size: 39.9 KB

Line
1	Hadoop/Map-reduce
2
3	https://www.guru99.com/create-your-first-hadoop-program.html
4
5	--------------
6	To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
7	1. ssh analytics -Y
8	2. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
9	or
10	vagrant ssh -- -Y node1
11	(the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
12
13	Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
14	- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
15	- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost\|10.211.55.101\|node1.
16
17	===========================================
18	WARC TO WET
19	===========================================
20	https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
21
22	Sebastian Nagel
23	05/07/2017
24	Hi,
25
26	unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
27
28	But it's easy to run the WET extractor on the WARC files, see:
29	https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
30	https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion
31
32	That's what you have to do:
33
34	# download the WARC files and place them in a directory "warc/"
35	# create sibling folders wat and wet
36	# \|
37	# \|-- warc/
38	# \| \|-- CC-NEWS-20161001224340-00008.warc.gz
39	# \| \|-- CC-NEWS-20161017145313-00000.warc.gz
40	# \| `-- ...
41	# \|
42	# \|-- wat/
43	# \|
44	# `-- wet/
45
46	git clone https://github.com/commoncrawl/ia-web-commons
47	cd ia-web-commons
48	mvn install
49
50	cd ..
51	git clone https://github.com/commoncrawl/ia-hadoop-tools
52	cd ia-hadoop-tools
53	mvn package
54
55	java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
56	-strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz
57
58	The folders wat/ and wet/ will then contain the exports.
59
60	Best,
61	Sebastian
62
63	---
64
65	1. So following the above instructions, I first made a warc subfolder in hdfs:///user/vagrant/cc-mri-subset/
66	Then moved all the downloaded *warc.gz into there.
67	Then created wat and wet subfolders in there alongside the warc folder.
68
69	2. Next, I did the 2 git clone and mvn compile operations above.
70	The first, ia-web-commons, successfully compiled (despite some test failures)
71
72	3. BUT THE 2ND GIT PROJECT, ia-hadoop-tools, DIDN'T COMPILE AT FIRST, with the mvn package command failing:
73
74	git clone https://github.com/commoncrawl/ia-hadoop-tools
75	cd ia-hadoop-tools
76	mvn package
77
78	Compile failed with a message about the JSONTokener constructor not taking a String object. It turned out that the JSONTokener used was a version of the class that was too old. Whereas the necessary constructor is present in the most recent version, as seen in the API at https://docs.oracle.com/cd/E51273_03/ToolsAndFrameworks.110/apidoc/assembler/com/endeca/serialization/json/JSONTokener.html
79
80	So instead, I opened up ia-hadoop-tools/pom.xml for editing and added the newest version of the org.json package's json (see http://builds.archive.org/maven2/org/json/json/ for <version>) into the pom.xml's <dependencies> element, based on how it this was done at https://bukkit.org/threads/problem-loading-libraries-with-maven-java-lang-noclassdeffounderror-org-json-jsonobject.428354/:
81
82	<dependency>
83	<groupId>org.json</groupId>
84	<artifactId>json</artifactId>
85	<version>20131018</version>
86	</dependency>
87
88	Then I was able to run "mvn package" successfully.
89	(Maybe I could also have added in a far more recent version, as seen in the version numbers at https://mvnrepository.com/artifact/org.json/json,
90	but didn't want to go too far ahead in case there was other incompatibility.)
91
92	4. Next, I wanted to finally run the built executable to convert the warc files to wet files.
93
94	I had the warc files on the hadoop filesystem. The original instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to however were apparently for working with warcs stored on the local filesystem, as those instructions did not run the hadoop command but the regular java command. The regular java command did not work with the files being on the hadoop system (attempt #1 below).
95
96	ATTEMPTS THAT DIDN'T WORK:
97	1. vagrant@node1:~/ia-hadoop-tools$ java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
98	2. vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
99
100
101	The 2nd attempt, which uses a proper hadoop command, I based off https://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm
102	It produced lots of errors and the output wet (and wat) .gz files were all corrupt as gunzip could not successfully run over them:
103
104	vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
105	19/09/05 05:57:22 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
106	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100139-000000.warc.gz
107	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100141-000001.warc.gz
108	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100451-000002.warc.gz
109	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100453-000003.warc.gz
110	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100805-000004.warc.gz
111	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100809-000005.warc.gz
112	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000006.warc.gz
113	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000007.warc.gz
114	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000008.warc.gz
115	19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000009.warc.gz
116	19/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
117	19/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
118	19/09/05 05:57:23 INFO mapred.FileInputFormat: Total input paths to process : 10
119	19/09/05 05:57:24 INFO mapreduce.JobSubmitter: number of splits:10
120	19/09/05 05:57:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1567397114047_0001
121	19/09/05 05:57:24 INFO impl.YarnClientImpl: Submitted application application_1567397114047_0001
122	19/09/05 05:57:24 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1567397114047_0001/
123	19/09/05 05:57:24 INFO mapreduce.Job: Running job: job_1567397114047_0001
124	19/09/05 05:57:31 INFO mapreduce.Job: Job job_1567397114047_0001 running in uber mode : false
125	19/09/05 05:57:31 INFO mapreduce.Job: map 0% reduce 0%
126	19/09/05 05:57:44 INFO mapreduce.Job: map 10% reduce 0%
127	19/09/05 05:57:44 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000002_0, Status : FAILED
128	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
129	Container killed by the ApplicationMaster.
130	Container killed on request. Exit code is 143
131	Container exited with a non-zero exit code 143
132
133	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000004_0, Status : FAILED
134	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
135	Container killed by the ApplicationMaster.
136	Container killed on request. Exit code is 143
137	Container exited with a non-zero exit code 143
138
139	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
140	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
141	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000005_0, Status : FAILED
142	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
143	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000000_0, Status : FAILED
144	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
145	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000003_0, Status : FAILED
146	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
147	19/09/05 05:57:46 INFO mapreduce.Job: map 0% reduce 0%
148	19/09/05 05:57:54 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000007_0, Status : FAILED
149	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
150	19/09/05 05:57:55 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000006_0, Status : FAILED
151	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
152	19/09/05 05:57:57 INFO mapreduce.Job: map 10% reduce 0%
153	19/09/05 05:57:57 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000009_0, Status : FAILED
154	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
155	Container killed by the ApplicationMaster.
156	Container killed on request. Exit code is 143
157	Container exited with a non-zero exit code 143
158
159	19/09/05 05:57:58 INFO mapreduce.Job: map 20% reduce 0%
160	19/09/05 05:57:58 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000008_0, Status : FAILED
161	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
162	19/09/05 05:58:06 INFO mapreduce.Job: map 30% reduce 0%
163	19/09/05 05:58:08 INFO mapreduce.Job: map 60% reduce 0%
164	19/09/05 05:58:09 INFO mapreduce.Job: map 70% reduce 0%
165	19/09/05 05:58:10 INFO mapreduce.Job: map 80% reduce 0%
166	19/09/05 05:58:12 INFO mapreduce.Job: map 90% reduce 0%
167	19/09/05 05:58:13 INFO mapreduce.Job: map 100% reduce 0%
168	19/09/05 05:58:13 INFO mapreduce.Job: Job job_1567397114047_0001 completed successfully
169	19/09/05 05:58:13 INFO mapreduce.Job: Counters: 32
170	File System Counters
171	FILE: Number of bytes read=0
172	FILE: Number of bytes written=1239360
173	FILE: Number of read operations=0
174	FILE: Number of large read operations=0
175	FILE: Number of write operations=0
176	HDFS: Number of bytes read=1430
177	HDFS: Number of bytes written=0
178	HDFS: Number of read operations=30
179	HDFS: Number of large read operations=0
180	HDFS: Number of write operations=0
181	Job Counters
182	Failed map tasks=10
183	Launched map tasks=20
184	Other local map tasks=10
185	Data-local map tasks=10
186	Total time spent by all maps in occupied slots (ms)=208160
187	Total time spent by all reduces in occupied slots (ms)=0
188	Total time spent by all map tasks (ms)=208160
189	Total vcore-milliseconds taken by all map tasks=208160
190	Total megabyte-milliseconds taken by all map tasks=213155840
191	Map-Reduce Framework
192	Map input records=10
193	Map output records=0
194	Input split bytes=1430
195	Spilled Records=0
196	Failed Shuffles=0
197	Merged Map outputs=0
198	GC time elapsed (ms)=1461
199	CPU time spent (ms)=2490
200	Physical memory (bytes) snapshot=1564528640
201	Virtual memory (bytes) snapshot=19642507264
202	Total committed heap usage (bytes)=1126170624
203	File Input Format Counters
204	Bytes Read=0
205	File Output Format Counters
206	Bytes Written=0
207	vagrant@node1:~/ia-hadoop-tools$
208
209
210	5. The error messages are all the same but not very informative
211	19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
212	Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
213
214	All the references I could find on google indicated that the full version of the error message was that this method (com.google.common.io.ByteStreams.limit(...)) could not be located.
215	The page at http://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%[email protected]%3E
216	revealed that guava.jar contains the com.google.common.io.ByteStreams class.
217
218
219	TO GET THE EXECUTABLE TO WORK:
220	I located guava.jar, found there were 2 identical ones on the filesystem but that neither was on the hadoop classpath yet, so I copied it into one of the Hadoop Classpath locations. Then I was able to successfully run the executable and produce meaningful WET files at last from the WARC input files:
221
222
223	vagrant@node1:~$ locate guava.jar
224	/usr/share/java/guava.jar
225	/usr/share/maven/lib/guava.jar
226	vagrant@node1:~$ jar -tvf /usr/share/maven/lib/guava.jar \| less
227	vagrant@node1:~$ jar -tvf /usr/share/java/guava.jar \| less
228	# both contained the ByteStreams class
229
230	vagrant@node1:~$ cd -
231	/home/vagrant/ia-hadoop-tools
232	vagrant@node1:~/ia-hadoop-tools$ find . -name "guava.jar"
233	# None in the git project
234
235	vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
236	/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/:/usr/local/hadoop/share/hadoop/common/:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/:/usr/local/hadoop/share/hadoop/hdfs/:/usr/local/hadoop/share/hadoop/yarn/lib/:/usr/local/hadoop/share/hadoop/yarn/:/usr/local/hadoop/share/hadoop/mapreduce/lib/:/usr/local/hadoop/share/hadoop/mapreduce/:/contrib/capacity-scheduler/*.jar
237	# guava.jar not on hadoop classpath yet
238
239	vagrant@node1:~/ia-hadoop-tools$ diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
240	# no differences, identical
241
242	vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -put /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
243	put: `/usr/local/hadoop/share/hadoop/common/.': No such file or directory
244	# hadoop classpath locations are not on the hdfs filesystem, but on the regular fs
245
246	vagrant@node1:~/ia-hadoop-tools$ sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
247	vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
248	/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/:/usr/local/hadoop/share/hadoop/common/:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/:/usr/local/hadoop/share/hadoop/hdfs/:/usr/local/hadoop/share/hadoop/yarn/lib/:/usr/local/hadoop/share/hadoop/yarn/:/usr/local/hadoop/share/hadoop/mapreduce/lib/:/usr/local/hadoop/share/hadoop/mapreduce/:/contrib/capacity-scheduler/*.jar
249	# Copied guava.jar to somewhere on existing hadoop classpath
250
251	vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
252	# Successful run
253
254	vagrant@node1:~$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/.
255	vagrant@node1:~$ cd ..
256	vagrant@node1:~$ gunzip MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz
257	vagrant@node1:~$ less MAORI-CC-2019-30-20190902100139-000000.warc.wet
258	# Copied a WET output file from the hadoop filesystem to local filesystem and inspected its contents. Works!
259
260	-----------------------------------
261	VIEW THE MRI-ONLY INDEX GENERATED
262	-----------------------------------
263	hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* \| tail -5
264
265	(gz archive, binary file)
266
267	vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -mkdir hdfs:///user/vagrant/cc-mri-unzipped-csv
268
269	# https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop
270	XXX vagrant@node1:~/cc-index-table/src/script$ hadoop fs -cat hdfs:///user/vagrant/cc-mri-csv/part* \| gzip -d \| hadoop fs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv
271
272
273	vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* \| gzip -d \| hdfs dfs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv
274	vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -ls hdfs:///user/vagrant/cc-mri-unzipped-csv
275	Found 1 items
276	-rw-r--r-- 1 vagrant supergroup 71664603 2019-08-29 04:47 hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv
277
278	# https://stackoverflow.com/questions/14925323/view-contents-of-file-in-hdfs-hadoop
279	vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv \| tail -5
280
281	# url, warc_filename, warc_record_offset, warc_record_length
282	http://paupauocean.com/page91?product_id=142&brd=1,crawl-data/CC-MAIN-2019-30/segments/1563195526940.0/warc/CC-MAIN-20190721082354-20190721104354-00088.warc.gz,115081770,21404
283	https://cookinseln-reisen.de/cook-inseln/rarotonga/,crawl-data/CC-MAIN-2019-30/segments/1563195526799.4/warc/CC-MAIN-20190720235054-20190721021054-00289.warc.gz,343512295,12444
284	http://www.halopharm.com/mi/profile/,crawl-data/CC-MAIN-2019-30/segments/1563195525500.21/warc/CC-MAIN-20190718042531-20190718064531-00093.warc.gz,219160333,10311
285	https://www.firstpeople.us/pictures/green/Touched-by-the-hand-of-Time-1907.html,crawl-data/CC-MAIN-2019-30/segments/1563195526670.1/warc/CC-MAIN-20190720194009-20190720220009-00362.warc.gz,696195242,5408
286	https://www.sos-accessoire.com/programmateur-programmateur-module-electronique-whirlpool-481231028062-27573.html,crawl-data/CC-MAIN-2019-30/segments/1563195527048.80/warc/CC-MAIN-20190721144008-20190721170008-00164.warc.gz,830087190,26321
287
288	# https://stackoverflow.com/questions/32612867/how-to-count-lines-in-a-file-on-hdfs-command
289	vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv \| wc -l
290	345625
291
292
293
294	vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* > file.csv.gz
295	vagrant@node1:~/cc-index-table$ less file.csv.gz
296
297
298	https://www.patricia-anong.com/blog/2017/11/1/extend-vmdk-on-virtualbox
299
300
301	When not using LIKE '%mri%' but = 'mri' instead:
302	vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv \| wc -l
303	5767
304
305	-----------------------------------------
306	Running export_mri_subset.sh
307	-----------------------------------------
308
309	The export_mri_subset.sh script is set up run on the csv input file produced by running export_mri_index_csv.sh
310
311	Running this initially produced the following exception:
312
313
314	2019-08-29 05:48:52 INFO CCIndexExport:152 - Number of records/rows matched by query: 345624
315	2019-08-29 05:48:52 INFO CCIndexExport:157 - Distributing 345624 records to 70 output partitions (max. 5000 records per WARC file)
316	2019-08-29 05:48:52 INFO CCIndexExport:165 - Repartitioning data to 70 output partitions
317	Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`url`' given input columns: [http://176.31.110.213:600/?p=287, crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz, 1215489, 15675];;
318	'Project ['url, 'warc_filename, 'warc_record_offset, 'warc_record_length]
319	+- AnalysisBarrier
320	+- Repartition 70, true
321	+- Relation[http://176.31.110.213:600/?p=287#10,crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz#11,1215489#12,15675#13] csv
322
323	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
324	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
325	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
326	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
327	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
328	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
329	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
330	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
331	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
332	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
333	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
334	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
335	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
336	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
337	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
338	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
339	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
340	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
341	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
342	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
343	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
344	at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
345	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
346	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
347	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
348	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
349	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
350	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
351	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
352	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
353	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
354	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
355	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
356	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3295)
357	at org.apache.spark.sql.Dataset.select(Dataset.scala:1307)
358	at org.apache.spark.sql.Dataset.select(Dataset.scala:1325)
359	at org.apache.spark.sql.Dataset.select(Dataset.scala:1325)
360	at org.commoncrawl.spark.examples.CCIndexWarcExport.run(CCIndexWarcExport.java:169)
361	at org.commoncrawl.spark.examples.CCIndexExport.run(CCIndexExport.java:192)
362	at org.commoncrawl.spark.examples.CCIndexWarcExport.main(CCIndexWarcExport.java:214)
363	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
364	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
365	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
366	at java.lang.reflect.Method.invoke(Method.java:498)
367	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
368	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
369	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
370	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
371	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
372	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
373	2019-08-29 05:48:52 INFO SparkContext:54 - Invoking stop() from shutdown hook
374
375
376
377	Hints to solve it were at https://stackoverflow.com/questions/45972929/scala-dataframereader-keep-column-headers
378	The actual solution is to edit the CCIndexWarcExport.java as follows:
379	1. set option(header) to false since the csv file contains no header row, only data rows. You can confirm the csv has no header row by doing
380	hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* \| head -5
381
382	2. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
383
384	emacs src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java
385
386	Change:
387	sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
388	.load(csvQueryResult);
389	To
390	sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
391	.load(csvQueryResult);
392
393	And comment out:
394	//JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
395	.toJavaRDD();
396	Replace with the default inferred column names:
397	JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
398	.toJavaRDD();
399
400
401	Now recompile:
402	mvn package
403
404	And run:
405	./src/script/export_mri_subset.sh
406
407	-------------------------
408
409	WET example from https://github.com/commoncrawl/cc-warc-examples
410
411	vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
412	vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
413	vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
414	Found 1 items
415	-rw-r--r-- 1 vagrant supergroup 154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
416	vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
417
418	<ONCE FINISHED:>
419
420	vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
421
422
423
424	INFO ON HADOOP/HDFS:
425	https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
426
427	SPARK:
428	configure option example: https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions
429
430
431
432	LIKE '%isl%'
433
434	cd cc-index-table
435	APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
436	> $SPARK_HOME/bin/spark-submit \
437	# $SPARK_ON_YARN \
438	--conf spark.hadoop.parquet.enable.dictionary=true \
439	--conf spark.hadoop.parquet.enable.summary-metadata=false \
440	--conf spark.sql.hive.metastorePartitionPruning=true \
441	--conf spark.sql.parquet.filterPushdown=true \
442	--conf spark.sql.parquet.mergeSchema=true \
443	--class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \
444	--query "SELECT url, warc_filename, warc_record_offset, warc_record_length
445	FROM ccindex
446	WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
447	--numOutputPartitions 12 \
448	--numRecordsPerWarcFile 20000 \
449	--warcPrefix ICELANDIC-CC-2018-43 \
450	s3://commoncrawl/cc-index/table/cc-main/warc/ \
451	.../my_output_path/
452
453
454	----
455	TIME
456	----
457	1. https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling
458	http://digitalpebble.blogspot.com/2017/03/need-billions-of-web-pages-dont-bother_29.html
459
460	"So, not only have CommonCrawl given you loads of web data for free, theyâve also made your life easier by preprocessing the data for you. For many tasks, the content of the WAT or WET files will be sufficient and you wonât have to process the WARC files.
461
462	This should not only help you simplify your code but also make the whole processing faster. We recently ran an experiment on CommonCrawl where we needed to extract anchor text from HTML pages. We initially wrote some MapReduce code to extract the binary content of the pages from their WARC representation, processed the HTML with JSoup and reduced on the anchor text. Processing a single WARC segment took roughly 100 minutes on a 10-node EMR cluster. We then simplified the extraction logic, took the WAT files as input and the processing time dropped to 17 minutes on the same cluster. This gain was partly due to not having to parse the web pages, but also to the fact that WAT files are a lot smaller than their WARC counterparts."
463
464	2. https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
465	"Both the parsing part and the processing part take just a couple of minutes per index file / WET file - the bulk of the âcomputeâ lies within actually downloading these files.
466
467	Essentially if you have some time to spare and an unlimited Internet connection, all of this processing can be done on one powerful machine. You can be fancy and go ahead and rent some Amazon server(s) to minimize the download time, but that can be costly.
468
469	In my experience - parsing the whole index for Russian websites (just filtering by language) takes approximately 140 hours - but the majority of this time is just downloading (my speed averaged ~300-500 kb/s)."
470
471	----
472	CMDS
473	----
474	https://stackoverflow.com/questions/29565716/spark-kill-running-application
475
476	=========================================================
477	Configuring spark to work on Amazon AWS s3a dataset:
478	=========================================================
479	https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
480	http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/
481	https://answers.dataiku.com/1734/common-crawl-s3
482	https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir
483	https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml
484
485	https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w
486
487
488	https://sparkour.urizone.net/recipes/using-s3/
489	Configuring Spark to Use Amazon S3
490	"Some Spark tutorials show AWS access keys hardcoded into the file paths. This is a horribly insecure approach and should never be done. Use exported environment variables or IAM Roles instead, as described in Configuring Amazon S3 as a Spark Data Source."
491
492	"No FileSystem for scheme: s3n
493
494	java.io.IOException: No FileSystem for scheme: s3n
495
496	This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you can use the --packages parameter and Spark will use Maven to locate the missing dependencies and distribute them to the cluster. Alternately, you can use --jars if you manually downloaded the dependencies already. These parameters also works on the spark-submit script."
497
498	===========================================
499	IAM Role (or user) and commoncrawl profile
500	===========================================
501
502	"iam" role or user for commoncrawl(er) profile
503
504
505	aws management console:
506	[email protected]
507	lab pwd, capital R and ! (maybe g)
508
509	commoncrawl profile created while creating the user/role, by following the instructions at: https://answers.dataiku.com/1734/common-crawl-s3
510
511	<!--
512	<property>
513	<name>fs.s3a.awsAccessKeyId</name>
514	<value>XXX</value>
515	</property>
516	<property>
517	<name>fs.s3a.awsSecretAccessKey</name>
518	<value>XXX</value>
519	</property>
520	-->
521
522
523	[If accesskey and secret were specified in hadoop core-site.xml and not in spark conf props file, then running export_maori_index_csv.sh produced the following error:
524
525	2019-08-29 06:16:38 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
526	2019-08-29 06:16:40 WARN FileStreamSink:66 - Error while looking for metadata directory.
527	Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
528	at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
529	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
530	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
531	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
532	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
533	]
534
535	Instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml)
536
537	you'll want to put the Amazon AWS access key and secret key in the spark properties file:
538
539	sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
540
541
542	The spark properties conf file above should contain:
543
544	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
545	spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY
546	spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY
547
548
549
550	When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
551
552	-------------
553
554	APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
555	$SPARK_HOME/bin/spark-submit \
556	--conf spark.hadoop.parquet.enable.dictionary=true \
557	--conf spark.hadoop.parquet.enable.summary-metadata=false \
558	--conf spark.sql.hive.metastorePartitionPruning=true \
559	--conf spark.sql.parquet.filterPushdown=true \
560	--conf spark.sql.parquet.mergeSchema=true \
561	--class org.commoncrawl.spark.examples.CCIndexExport $APPJAR \
562	--query "SELECT url, warc_filename, warc_record_offset, warc_record_length
563	FROM ccindex
564	WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
565	--outputFormat csv \
566	--numOutputPartitions 10 \
567	--outputCompression gzip \
568	s3://commoncrawl/cc-index/table/cc-main/warc/ \
569	hdfs:///user/vagrant/cc-mri-csv
570
571	----------------
572	Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
573
574
575	https://stackoverflow.com/questions/39355354/spark-no-filesystem-for-scheme-https-cannot-load-files-from-amazon-s3
576	https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3
577	"2018-01-10 Update Hadoop 3.0 has cut its s3: and s3n implementations: s3a is all you get. It is now significantly better than its predecessor and performs as least as good as the Amazon implementation. Amazon's "s3:" is still offered by EMR, which is their closed source client. Consult the EMR docs for more info."
578
579	1. https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
580
581	"Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
582
583	Here are the key parts, as of December 2015:
584
585	Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
586
587	You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
588
589	You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
590
591	In spark.properties you probably want some settings that look like this:
592
593	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
594	spark.hadoop.fs.s3a.access.key=ACCESSKEY
595	spark.hadoop.fs.s3a.secret.key=SECRETKEY
596
597	I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them."
598
599
600	2. The classpath used by hadoop can be found by running the command (https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath):
601	hadoop classpath
602
603
604	3. Got hadoop-aws 2.7.6 jar
605	from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
606	and put it into /home/vagrant
607
608
609	4. https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath
610	https://stackoverflow.com/questions/28520821/how-to-add-external-jar-to-hadoop-job/54459211#54459211
611	vagrant@node1:~$ export LIBJARS=/home/vagrant/hadoop-aws-2.7.6.jar
612	vagrant@node1:~$ export HADOOP_CLASSPATH=`echo ${LIBJARS} \| sed s/,/:/g`
613	vagrant@node1:~$ hadoop classpath
614
615	5. https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
616	"Download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory. You should see the aws-java-sdk-1.10.65.jar in /usr/hdp/2.4.0.0-169/hadoop/"
617
618	I got version 1.11
619
620	[Can't find a spark.properties file, but this seems to contain spark specific properties:
621	$SPARK_HOME/conf/spark-defaults.conf
622
623	https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-properties.html
624	"The default Spark properties file is $SPARK_HOME/conf/spark-defaults.conf that could be overriden using spark-submit with the --properties-file command-line option."]
625
626	Can SUDO COPY the 2 jar files hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar to:
627	/usr/local/hadoop/share/hadoop/common/
628	(else /usr/local/hadoop/share/hadoop/hdfs/hadoop-aws-2.7.6.jar)
629
630	--------
631	schema
632	https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
633
634	---------------
635	More examples to try:
636	https://github.com/commoncrawl/cc-warc-examples
637
638
639	A bit outdated?
640	https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
641	https://www.journaldev.com/20261/apache-spark
642
643	--------
644
645	sudo apt-get install maven
646	(or sudo apt update
647	sudo apt install maven)
648	git clone https://github.com/commoncrawl/cc-index-table.git
649	cd cc-index-table
650	mvn package
651	vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
652
653
654
655
656	spark:
657	https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
658
659	============
660	Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
661
662	https://github.com/martinprobson/vagrant-hadoop-hive-spark
663
664	Vagrant:
665	* Guide: https://www.vagrantup.com/intro/getting-started/index.html
666	* Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
667	* vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
668	* https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
669	* https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
670	sudo apt-get -y install firefox
671	* vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
672
673	* hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
674	* https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
675	---
676	==> node1: Forwarding ports...
677	node1: 8080 (guest) => 8081 (host) (adapter 1)
678	node1: 8088 (guest) => 8089 (host) (adapter 1)
679	node1: 9083 (guest) => 9084 (host) (adapter 1)
680	node1: 4040 (guest) => 4041 (host) (adapter 1)
681	node1: 18888 (guest) => 18889 (host) (adapter 1)
682	node1: 16010 (guest) => 16011 (host) (adapter 1)
683	node1: 22 (guest) => 2200 (host) (adapter 1)
684	==> node1: Running 'pre-boot' VM customizations...
685
686
687	==> node1: Checking for guest additions in VM...
688	node1: The guest additions on this VM do not match the installed version of
689	node1: VirtualBox! In most cases this is fine, but in rare cases it can
690	node1: prevent things such as shared folders from working properly. If you see
691	node1: shared folder errors, please make sure the guest additions within the
692	node1: virtual machine match the version of VirtualBox you have installed on
693	node1: your host and reload your VM.
694	node1:
695	node1: Guest Additions Version: 5.1.38
696	node1: VirtualBox Version: 5.2
697
698	------------

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt@ 33467

Download in other formats: