source: gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt@ 33496

Last change on this file since 33496 was 33496, checked in by ak19, 5 years ago

Minor changes to reading list file

File size: 40.1 KB
Line 
1Hadoop/Map-reduce
2
3https://www.guru99.com/create-your-first-hadoop-program.html
4
5--------------
6To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
71. ssh analytics -Y
82. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
9or
10vagrant ssh -- -Y node1
11(the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
12
13Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
14- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
15- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
16
17===========================================
18 WARC TO WET
19===========================================
20https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
21
22Sebastian Nagel
2305/07/2017
24Hi,
25
26unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
27
28But it's easy to run the WET extractor on the WARC files, see:
29 https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
30 https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion
31
32That's what you have to do:
33
34# download the WARC files and place them in a directory "warc/"
35# create sibling folders wat and wet
36# |
37# |-- warc/
38# | |-- CC-NEWS-20161001224340-00008.warc.gz
39# | |-- CC-NEWS-20161017145313-00000.warc.gz
40# | `-- ...
41# |
42# |-- wat/
43# |
44# `-- wet/
45
46git clone https://github.com/commoncrawl/ia-web-commons
47cd ia-web-commons
48mvn install
49
50cd ..
51git clone https://github.com/commoncrawl/ia-hadoop-tools
52cd ia-hadoop-tools
53mvn package
54
55java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
56 -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz
57
58The folders wat/ and wet/ will then contain the exports.
59
60Best,
61Sebastian
62
63---
64
651. So following the above instructions, I first made a warc subfolder in hdfs:///user/vagrant/cc-mri-subset/
66Then moved all the downloaded *warc.gz into there.
67Then created wat and wet subfolders in there alongside the warc folder.
68
692. Next, I did the 2 git clone and mvn compile operations above.
70The first, ia-web-commons, successfully compiled (despite some test failures)
71
723. BUT THE 2ND GIT PROJECT, ia-hadoop-tools, DIDN'T COMPILE AT FIRST, with the mvn package command failing:
73
74git clone https://github.com/commoncrawl/ia-hadoop-tools
75cd ia-hadoop-tools
76mvn package
77
78Compile failed with a message about the JSONTokener constructor not taking a String object. It turned out that the JSONTokener used was a version of the class that was too old. Whereas the necessary constructor is present in the most recent version, as seen in the API at https://docs.oracle.com/cd/E51273_03/ToolsAndFrameworks.110/apidoc/assembler/com/endeca/serialization/json/JSONTokener.html
79
80So instead, I opened up ia-hadoop-tools/pom.xml for editing and added the newest version of the org.json package's json (see http://builds.archive.org/maven2/org/json/json/ for <version>) into the pom.xml's <dependencies> element, based on how it this was done at https://bukkit.org/threads/problem-loading-libraries-with-maven-java-lang-noclassdeffounderror-org-json-jsonobject.428354/:
81
82 <dependency>
83 <groupId>org.json</groupId>
84 <artifactId>json</artifactId>
85 <version>20131018</version>
86 </dependency>
87
88Then I was able to run "mvn package" successfully.
89(Maybe I could also have added in a far more recent version, as seen in the version numbers at https://mvnrepository.com/artifact/org.json/json,
90but didn't want to go too far ahead in case there was other incompatibility.)
91
924. Next, I wanted to finally run the built executable to convert the warc files to wet files.
93
94I had the warc files on the hadoop filesystem. The original instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to however were apparently for working with warcs stored on the local filesystem, as those instructions did not run the hadoop command but the regular java command. The regular java command did not work with the files being on the hadoop system (attempt #1 below).
95
96ATTEMPTS THAT DIDN'T WORK:
971. vagrant@node1:~/ia-hadoop-tools$ java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
982. vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
99
100
101The 2nd attempt, which uses a proper hadoop command, I based off https://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm
102It produced lots of errors and the output wet (and wat) .gz files were all corrupt as gunzip could not successfully run over them:
103
104vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
10519/09/05 05:57:22 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
10619/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100139-000000.warc.gz
10719/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100141-000001.warc.gz
10819/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100451-000002.warc.gz
10919/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100453-000003.warc.gz
11019/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100805-000004.warc.gz
11119/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100809-000005.warc.gz
11219/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000006.warc.gz
11319/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000007.warc.gz
11419/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000008.warc.gz
11519/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000009.warc.gz
11619/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
11719/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
11819/09/05 05:57:23 INFO mapred.FileInputFormat: Total input paths to process : 10
11919/09/05 05:57:24 INFO mapreduce.JobSubmitter: number of splits:10
12019/09/05 05:57:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1567397114047_0001
12119/09/05 05:57:24 INFO impl.YarnClientImpl: Submitted application application_1567397114047_0001
12219/09/05 05:57:24 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1567397114047_0001/
12319/09/05 05:57:24 INFO mapreduce.Job: Running job: job_1567397114047_0001
12419/09/05 05:57:31 INFO mapreduce.Job: Job job_1567397114047_0001 running in uber mode : false
12519/09/05 05:57:31 INFO mapreduce.Job: map 0% reduce 0%
12619/09/05 05:57:44 INFO mapreduce.Job: map 10% reduce 0%
12719/09/05 05:57:44 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000002_0, Status : FAILED
128Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
129Container killed by the ApplicationMaster.
130Container killed on request. Exit code is 143
131Container exited with a non-zero exit code 143
132
13319/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000004_0, Status : FAILED
134Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
135Container killed by the ApplicationMaster.
136Container killed on request. Exit code is 143
137Container exited with a non-zero exit code 143
138
13919/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
140Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
14119/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000005_0, Status : FAILED
142Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
14319/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000000_0, Status : FAILED
144Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
14519/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000003_0, Status : FAILED
146Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
14719/09/05 05:57:46 INFO mapreduce.Job: map 0% reduce 0%
14819/09/05 05:57:54 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000007_0, Status : FAILED
149Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
15019/09/05 05:57:55 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000006_0, Status : FAILED
151Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
15219/09/05 05:57:57 INFO mapreduce.Job: map 10% reduce 0%
15319/09/05 05:57:57 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000009_0, Status : FAILED
154Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
155Container killed by the ApplicationMaster.
156Container killed on request. Exit code is 143
157Container exited with a non-zero exit code 143
158
15919/09/05 05:57:58 INFO mapreduce.Job: map 20% reduce 0%
16019/09/05 05:57:58 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000008_0, Status : FAILED
161Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
16219/09/05 05:58:06 INFO mapreduce.Job: map 30% reduce 0%
16319/09/05 05:58:08 INFO mapreduce.Job: map 60% reduce 0%
16419/09/05 05:58:09 INFO mapreduce.Job: map 70% reduce 0%
16519/09/05 05:58:10 INFO mapreduce.Job: map 80% reduce 0%
16619/09/05 05:58:12 INFO mapreduce.Job: map 90% reduce 0%
16719/09/05 05:58:13 INFO mapreduce.Job: map 100% reduce 0%
16819/09/05 05:58:13 INFO mapreduce.Job: Job job_1567397114047_0001 completed successfully
16919/09/05 05:58:13 INFO mapreduce.Job: Counters: 32
170 File System Counters
171 FILE: Number of bytes read=0
172 FILE: Number of bytes written=1239360
173 FILE: Number of read operations=0
174 FILE: Number of large read operations=0
175 FILE: Number of write operations=0
176 HDFS: Number of bytes read=1430
177 HDFS: Number of bytes written=0
178 HDFS: Number of read operations=30
179 HDFS: Number of large read operations=0
180 HDFS: Number of write operations=0
181 Job Counters
182 Failed map tasks=10
183 Launched map tasks=20
184 Other local map tasks=10
185 Data-local map tasks=10
186 Total time spent by all maps in occupied slots (ms)=208160
187 Total time spent by all reduces in occupied slots (ms)=0
188 Total time spent by all map tasks (ms)=208160
189 Total vcore-milliseconds taken by all map tasks=208160
190 Total megabyte-milliseconds taken by all map tasks=213155840
191 Map-Reduce Framework
192 Map input records=10
193 Map output records=0
194 Input split bytes=1430
195 Spilled Records=0
196 Failed Shuffles=0
197 Merged Map outputs=0
198 GC time elapsed (ms)=1461
199 CPU time spent (ms)=2490
200 Physical memory (bytes) snapshot=1564528640
201 Virtual memory (bytes) snapshot=19642507264
202 Total committed heap usage (bytes)=1126170624
203 File Input Format Counters
204 Bytes Read=0
205 File Output Format Counters
206 Bytes Written=0
207vagrant@node1:~/ia-hadoop-tools$
208
209
2105. The error messages are all the same but not very informative
211 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
212 Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
213
214All the references I could find on google indicated that the full version of the error message was that this method (com.google.common.io.ByteStreams.limit(...)) could not be located.
215The page at http://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%[email protected]%3E
216revealed that guava.jar contains the com.google.common.io.ByteStreams class.
217
218
219TO GET THE EXECUTABLE TO WORK:
220I located guava.jar, found there were 2 identical ones on the filesystem but that neither was on the hadoop classpath yet, so I copied it into one of the Hadoop Classpath locations. Then I was able to successfully run the executable and produce meaningful WET files at last from the WARC input files:
221
222
223vagrant@node1:~$ locate guava.jar
224/usr/share/java/guava.jar
225/usr/share/maven/lib/guava.jar
226vagrant@node1:~$ jar -tvf /usr/share/maven/lib/guava.jar | less
227vagrant@node1:~$ jar -tvf /usr/share/java/guava.jar | less
228# both contained the ByteStreams class
229
230vagrant@node1:~$ cd -
231/home/vagrant/ia-hadoop-tools
232vagrant@node1:~/ia-hadoop-tools$ find . -name "guava.jar"
233# None in the git project
234
235vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
236/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
237# guava.jar not on hadoop classpath yet
238
239vagrant@node1:~/ia-hadoop-tools$ diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
240# no differences, identical
241
242vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -put /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
243put: `/usr/local/hadoop/share/hadoop/common/.': No such file or directory
244# hadoop classpath locations are not on the hdfs filesystem, but on the regular fs
245
246vagrant@node1:~/ia-hadoop-tools$ sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
247vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
248/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
249# Copied guava.jar to somewhere on existing hadoop classpath
250
251vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
252# Successful run
253
254vagrant@node1:~$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/.
255vagrant@node1:~$ cd ..
256vagrant@node1:~$ gunzip MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz
257vagrant@node1:~$ less MAORI-CC-2019-30-20190902100139-000000.warc.wet
258# Copied a WET output file from the hadoop filesystem to local filesystem and inspected its contents. Works!
259
260-----------------------------------
261 VIEW THE MRI-ONLY INDEX GENERATED
262-----------------------------------
263hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | tail -5
264
265(gz archive, binary file)
266
267vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -mkdir hdfs:///user/vagrant/cc-mri-unzipped-csv
268
269# https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop
270XXX vagrant@node1:~/cc-index-table/src/script$ hadoop fs -cat hdfs:///user/vagrant/cc-mri-csv/part* | gzip -d | hadoop fs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv
271
272
273vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | gzip -d | hdfs dfs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv
274vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -ls hdfs:///user/vagrant/cc-mri-unzipped-csv
275Found 1 items
276-rw-r--r-- 1 vagrant supergroup 71664603 2019-08-29 04:47 hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv
277
278# https://stackoverflow.com/questions/14925323/view-contents-of-file-in-hdfs-hadoop
279vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | tail -5
280
281# url, warc_filename, warc_record_offset, warc_record_length
282http://paupauocean.com/page91?product_id=142&brd=1,crawl-data/CC-MAIN-2019-30/segments/1563195526940.0/warc/CC-MAIN-20190721082354-20190721104354-00088.warc.gz,115081770,21404
283https://cookinseln-reisen.de/cook-inseln/rarotonga/,crawl-data/CC-MAIN-2019-30/segments/1563195526799.4/warc/CC-MAIN-20190720235054-20190721021054-00289.warc.gz,343512295,12444
284http://www.halopharm.com/mi/profile/,crawl-data/CC-MAIN-2019-30/segments/1563195525500.21/warc/CC-MAIN-20190718042531-20190718064531-00093.warc.gz,219160333,10311
285https://www.firstpeople.us/pictures/green/Touched-by-the-hand-of-Time-1907.html,crawl-data/CC-MAIN-2019-30/segments/1563195526670.1/warc/CC-MAIN-20190720194009-20190720220009-00362.warc.gz,696195242,5408
286https://www.sos-accessoire.com/programmateur-programmateur-module-electronique-whirlpool-481231028062-27573.html,crawl-data/CC-MAIN-2019-30/segments/1563195527048.80/warc/CC-MAIN-20190721144008-20190721170008-00164.warc.gz,830087190,26321
287
288# https://stackoverflow.com/questions/32612867/how-to-count-lines-in-a-file-on-hdfs-command
289vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l
290345625
291
292
293ANOTHER WAY (DR BAINBRIDGE'S WAY) TO CREATE SINGLE .CSV FILE FROM /part* FILES AND VIEW ITS CONTENTS:
294vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* > file.csv.gz
295vagrant@node1:~/cc-index-table$ less file.csv.gz
296
297
298https://www.patricia-anong.com/blog/2017/11/1/extend-vmdk-on-virtualbox
299
300
301When not using LIKE '%mri%' but = 'mri' instead:
302vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l
3035767
304
305
306For a month later, the August 2019 crawl:
307vagrant@node1:~$ hdfs dfs -cat hdfs:///user/vagrant/CC-MAIN-2019-35/cc-mri-unzipped-csv/cc-mri.csv | wc -l
3089318
309
310-----------------------------------------
311Running export_mri_subset.sh
312-----------------------------------------
313
314The export_mri_subset.sh script is set up run on the csv input file produced by running export_mri_index_csv.sh
315
316Running this initially produced the following exception:
317
318
3192019-08-29 05:48:52 INFO CCIndexExport:152 - Number of records/rows matched by query: 345624
3202019-08-29 05:48:52 INFO CCIndexExport:157 - Distributing 345624 records to 70 output partitions (max. 5000 records per WARC file)
3212019-08-29 05:48:52 INFO CCIndexExport:165 - Repartitioning data to 70 output partitions
322Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`url`' given input columns: [http://176.31.110.213:600/?p=287, crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz, 1215489, 15675];;
323'Project ['url, 'warc_filename, 'warc_record_offset, 'warc_record_length]
324+- AnalysisBarrier
325 +- Repartition 70, true
326 +- Relation[http://176.31.110.213:600/?p=287#10,crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz#11,1215489#12,15675#13] csv
327
328 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
329 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
330 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
331 at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
332 at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
333 at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
334 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
335 at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
336 at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
337 at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
338 at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
339 at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
340 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
341 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
342 at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
343 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
344 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
345 at scala.collection.AbstractTraversable.map(Traversable.scala:104)
346 at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
347 at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
348 at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
349 at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
350 at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
351 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
352 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
353 at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
354 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
355 at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
356 at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
357 at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
358 at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
359 at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
360 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
361 at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3295)
362 at org.apache.spark.sql.Dataset.select(Dataset.scala:1307)
363 at org.apache.spark.sql.Dataset.select(Dataset.scala:1325)
364 at org.apache.spark.sql.Dataset.select(Dataset.scala:1325)
365 at org.commoncrawl.spark.examples.CCIndexWarcExport.run(CCIndexWarcExport.java:169)
366 at org.commoncrawl.spark.examples.CCIndexExport.run(CCIndexExport.java:192)
367 at org.commoncrawl.spark.examples.CCIndexWarcExport.main(CCIndexWarcExport.java:214)
368 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
369 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
370 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
371 at java.lang.reflect.Method.invoke(Method.java:498)
372 at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
373 at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
374 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
375 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
376 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
377 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
3782019-08-29 05:48:52 INFO SparkContext:54 - Invoking stop() from shutdown hook
379
380
381
382Hints to solve it were at https://stackoverflow.com/questions/45972929/scala-dataframereader-keep-column-headers
383The actual solution is to edit the CCIndexWarcExport.java as follows:
3841. set option(header) to false since the csv file contains no header row, only data rows. You can confirm the csv has no header row by doing
385 hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | head -5
386
3872. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
388
389emacs src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java
390
391Change:
392 sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
393 .load(csvQueryResult);
394To
395 sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
396 .load(csvQueryResult);
397
398And comment out:
399 //JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
400 .toJavaRDD();
401Replace with the default inferred column names:
402 JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
403 .toJavaRDD();
404
405
406Now recompile:
407 mvn package
408
409And run:
410 ./src/script/export_mri_subset.sh
411
412-------------------------
413
414WET example from https://github.com/commoncrawl/cc-warc-examples
415
416vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
417vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
418vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
419Found 1 items
420-rw-r--r-- 1 vagrant supergroup 154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
421vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
422
423<ONCE FINISHED:>
424
425vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
426
427
428
429INFO ON HADOOP/HDFS:
430https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
431
432SPARK:
433configure option example: https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions
434
435
436
437LIKE '%isl%'
438
439cd cc-index-table
440APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
441> $SPARK_HOME/bin/spark-submit \
442# $SPARK_ON_YARN \
443 --conf spark.hadoop.parquet.enable.dictionary=true \
444 --conf spark.hadoop.parquet.enable.summary-metadata=false \
445 --conf spark.sql.hive.metastorePartitionPruning=true \
446 --conf spark.sql.parquet.filterPushdown=true \
447 --conf spark.sql.parquet.mergeSchema=true \
448 --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \
449 --query "SELECT url, warc_filename, warc_record_offset, warc_record_length
450 FROM ccindex
451 WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
452 --numOutputPartitions 12 \
453 --numRecordsPerWarcFile 20000 \
454 --warcPrefix ICELANDIC-CC-2018-43 \
455 s3://commoncrawl/cc-index/table/cc-main/warc/ \
456 .../my_output_path/
457
458
459----
460TIME
461----
4621. https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling
463http://digitalpebble.blogspot.com/2017/03/need-billions-of-web-pages-dont-bother_29.html
464
465"So, not only have CommonCrawl given you loads of web data for free, they’ve also made your life easier by preprocessing the data for you. For many tasks, the content of the WAT or WET files will be sufficient and you won’t have to process the WARC files.
466
467This should not only help you simplify your code but also make the whole processing faster. We recently ran an experiment on CommonCrawl where we needed to extract anchor text from HTML pages. We initially wrote some MapReduce code to extract the binary content of the pages from their WARC representation, processed the HTML with JSoup and reduced on the anchor text. Processing a single WARC segment took roughly 100 minutes on a 10-node EMR cluster. We then simplified the extraction logic, took the WAT files as input and the processing time dropped to 17 minutes on the same cluster. This gain was partly due to not having to parse the web pages, but also to the fact that WAT files are a lot smaller than their WARC counterparts."
468
4692. https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
470"Both the parsing part and the processing part take just a couple of minutes per index file / WET file - the bulk of the “compute” lies within actually downloading these files.
471
472Essentially if you have some time to spare and an unlimited Internet connection, all of this processing can be done on one powerful machine. You can be fancy and go ahead and rent some Amazon server(s) to minimize the download time, but that can be costly.
473
474In my experience - parsing the whole index for Russian websites (just filtering by language) takes approximately 140 hours - but the majority of this time is just downloading (my speed averaged ~300-500 kb/s)."
475
476----
477CMDS
478----
479https://stackoverflow.com/questions/29565716/spark-kill-running-application
480
481=========================================================
482Configuring spark to work on Amazon AWS s3a dataset:
483=========================================================
484https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
485http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/
486https://answers.dataiku.com/1734/common-crawl-s3
487https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir
488https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml
489
490https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w
491
492
493https://sparkour.urizone.net/recipes/using-s3/
494Configuring Spark to Use Amazon S3
495"Some Spark tutorials show AWS access keys hardcoded into the file paths. This is a horribly insecure approach and should never be done. Use exported environment variables or IAM Roles instead, as described in Configuring Amazon S3 as a Spark Data Source."
496
497"No FileSystem for scheme: s3n
498
499java.io.IOException: No FileSystem for scheme: s3n
500
501This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you can use the --packages parameter and Spark will use Maven to locate the missing dependencies and distribute them to the cluster. Alternately, you can use --jars if you manually downloaded the dependencies already. These parameters also works on the spark-submit script."
502
503===========================================
504IAM Role (or user) and commoncrawl profile
505===========================================
506
507"iam" role or user for commoncrawl(er) profile
508
509
510aws management console:
511[email protected]
512lab pwd, capital R and ! (maybe g)
513
514commoncrawl profile created while creating the user/role, by following the instructions at: https://answers.dataiku.com/1734/common-crawl-s3
515
516<!--
517 <property>
518 <name>fs.s3a.awsAccessKeyId</name>
519 <value>XXX</value>
520 </property>
521 <property>
522 <name>fs.s3a.awsSecretAccessKey</name>
523 <value>XXX</value>
524 </property>
525-->
526
527
528[If accesskey and secret were specified in hadoop core-site.xml and not in spark conf props file, then running export_maori_index_csv.sh produced the following error:
529
5302019-08-29 06:16:38 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
5312019-08-29 06:16:40 WARN FileStreamSink:66 - Error while looking for metadata directory.
532Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
533 at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
534 at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
535 at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
536 at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
537 at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
538]
539
540Instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml)
541
542you'll want to put the Amazon AWS access key and secret key in the spark properties file:
543
544 sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
545
546
547The spark properties conf file above should contain:
548
549spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
550spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY
551spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY
552
553
554
555When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
556
557-------------
558
559APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
560$SPARK_HOME/bin/spark-submit \
561 --conf spark.hadoop.parquet.enable.dictionary=true \
562 --conf spark.hadoop.parquet.enable.summary-metadata=false \
563 --conf spark.sql.hive.metastorePartitionPruning=true \
564 --conf spark.sql.parquet.filterPushdown=true \
565 --conf spark.sql.parquet.mergeSchema=true \
566 --class org.commoncrawl.spark.examples.CCIndexExport $APPJAR \
567 --query "SELECT url, warc_filename, warc_record_offset, warc_record_length
568 FROM ccindex
569 WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
570 --outputFormat csv \
571 --numOutputPartitions 10 \
572 --outputCompression gzip \
573 s3://commoncrawl/cc-index/table/cc-main/warc/ \
574 hdfs:///user/vagrant/cc-mri-csv
575
576----------------
577Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
578
579
580https://stackoverflow.com/questions/39355354/spark-no-filesystem-for-scheme-https-cannot-load-files-from-amazon-s3
581https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3
582"2018-01-10 Update Hadoop 3.0 has cut its s3: and s3n implementations: s3a is all you get. It is now significantly better than its predecessor and performs as least as good as the Amazon implementation. Amazon's "s3:" is still offered by EMR, which is their closed source client. Consult the EMR docs for more info."
583
5841. https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
585
586"Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
587
588Here are the key parts, as of December 2015:
589
590 Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
591
592 You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
593
594 You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
595
596 In spark.properties you probably want some settings that look like this:
597
598 spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
599 spark.hadoop.fs.s3a.access.key=ACCESSKEY
600 spark.hadoop.fs.s3a.secret.key=SECRETKEY
601
602I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them."
603
604
6052. The classpath used by hadoop can be found by running the command (https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath):
606 hadoop classpath
607
608
6093. Got hadoop-aws 2.7.6 jar
610from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
611and put it into /home/vagrant
612
613
6144. https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath
615https://stackoverflow.com/questions/28520821/how-to-add-external-jar-to-hadoop-job/54459211#54459211
616vagrant@node1:~$ export LIBJARS=/home/vagrant/hadoop-aws-2.7.6.jar
617vagrant@node1:~$ export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
618vagrant@node1:~$ hadoop classpath
619
6205. https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
621"Download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory. You should see the aws-java-sdk-1.10.65.jar in /usr/hdp/2.4.0.0-169/hadoop/"
622
623I got version 1.11
624
625[Can't find a spark.properties file, but this seems to contain spark specific properties:
626$SPARK_HOME/conf/spark-defaults.conf
627
628https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-properties.html
629"The default Spark properties file is $SPARK_HOME/conf/spark-defaults.conf that could be overriden using spark-submit with the --properties-file command-line option."]
630
631Can SUDO COPY the 2 jar files hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar to:
632/usr/local/hadoop/share/hadoop/common/
633(else /usr/local/hadoop/share/hadoop/hdfs/hadoop-aws-2.7.6.jar)
634
635--------
636schema
637https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
638
639---------------
640More examples to try:
641https://github.com/commoncrawl/cc-warc-examples
642
643
644A bit outdated?
645https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
646https://www.journaldev.com/20261/apache-spark
647
648--------
649
650sudo apt-get install maven
651(or sudo apt update
652sudo apt install maven)
653git clone https://github.com/commoncrawl/cc-index-table.git
654cd cc-index-table
655mvn package
656vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
657
658
659
660
661spark:
662https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
663
664============
665Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
666
667https://github.com/martinprobson/vagrant-hadoop-hive-spark
668
669Vagrant:
670 * Guide: https://www.vagrantup.com/intro/getting-started/index.html
671 * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
672 * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
673 * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
674 * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
675 sudo apt-get -y install firefox
676 * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
677
678 * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
679 * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
680---
681==> node1: Forwarding ports...
682 node1: 8080 (guest) => 8081 (host) (adapter 1)
683 node1: 8088 (guest) => 8089 (host) (adapter 1)
684 node1: 9083 (guest) => 9084 (host) (adapter 1)
685 node1: 4040 (guest) => 4041 (host) (adapter 1)
686 node1: 18888 (guest) => 18889 (host) (adapter 1)
687 node1: 16010 (guest) => 16011 (host) (adapter 1)
688 node1: 22 (guest) => 2200 (host) (adapter 1)
689==> node1: Running 'pre-boot' VM customizations...
690
691
692==> node1: Checking for guest additions in VM...
693 node1: The guest additions on this VM do not match the installed version of
694 node1: VirtualBox! In most cases this is fine, but in rare cases it can
695 node1: prevent things such as shared folders from working properly. If you see
696 node1: shared folder errors, please make sure the guest additions within the
697 node1: virtual machine match the version of VirtualBox you have installed on
698 node1: your host and reload your VM.
699 node1:
700 node1: Guest Additions Version: 5.1.38
701 node1: VirtualBox Version: 5.2
702
703------------
Note: See TracBrowser for help on using the repository browser.