1 | Hadoop/Map-reduce
|
---|
2 |
|
---|
3 | https://www.guru99.com/create-your-first-hadoop-program.html
|
---|
4 |
|
---|
5 | Some Hadoop commands
|
---|
6 | * https://community.cloudera.com/t5/Support-Questions/Closed-How-to-store-output-of-shell-script-in-HDFS/td-p/229933
|
---|
7 | * https://stackoverflow.com/questions/26513861/checking-if-directory-in-hdfs-already-exists-or-not
|
---|
8 | --------------
|
---|
9 | To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
|
---|
10 | 1. ssh analytics -Y
|
---|
11 | 2. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
|
---|
12 | or
|
---|
13 | vagrant ssh -- -Y node1
|
---|
14 | (the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
|
---|
15 |
|
---|
16 | Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
|
---|
17 | - Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
|
---|
18 | - If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
|
---|
19 |
|
---|
20 | ===========================================
|
---|
21 | WARC TO WET
|
---|
22 | ===========================================
|
---|
23 | https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
|
---|
24 |
|
---|
25 | Sebastian Nagel
|
---|
26 | 05/07/2017
|
---|
27 | Hi,
|
---|
28 |
|
---|
29 | unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
|
---|
30 |
|
---|
31 | But it's easy to run the WET extractor on the WARC files, see:
|
---|
32 | https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
|
---|
33 | https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion
|
---|
34 |
|
---|
35 | That's what you have to do:
|
---|
36 |
|
---|
37 | # download the WARC files and place them in a directory "warc/"
|
---|
38 | # create sibling folders wat and wet
|
---|
39 | # |
|
---|
40 | # |-- warc/
|
---|
41 | # | |-- CC-NEWS-20161001224340-00008.warc.gz
|
---|
42 | # | |-- CC-NEWS-20161017145313-00000.warc.gz
|
---|
43 | # | `-- ...
|
---|
44 | # |
|
---|
45 | # |-- wat/
|
---|
46 | # |
|
---|
47 | # `-- wet/
|
---|
48 |
|
---|
49 | git clone https://github.com/commoncrawl/ia-web-commons
|
---|
50 | cd ia-web-commons
|
---|
51 | mvn install
|
---|
52 |
|
---|
53 | cd ..
|
---|
54 | git clone https://github.com/commoncrawl/ia-hadoop-tools
|
---|
55 | cd ia-hadoop-tools
|
---|
56 | mvn package
|
---|
57 |
|
---|
58 | java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
|
---|
59 | -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz
|
---|
60 |
|
---|
61 | The folders wat/ and wet/ will then contain the exports.
|
---|
62 |
|
---|
63 | Best,
|
---|
64 | Sebastian
|
---|
65 |
|
---|
66 | ---
|
---|
67 |
|
---|
68 | 1. So following the above instructions, I first made a warc subfolder in hdfs:///user/vagrant/cc-mri-subset/
|
---|
69 | Then moved all the downloaded *warc.gz into there.
|
---|
70 | Then created wat and wet subfolders in there alongside the warc folder.
|
---|
71 |
|
---|
72 | 2. Next, I did the 2 git clone and mvn compile operations above.
|
---|
73 | The first, ia-web-commons, successfully compiled (despite some test failures)
|
---|
74 |
|
---|
75 | 3. BUT THE 2ND GIT PROJECT, ia-hadoop-tools, DIDN'T COMPILE AT FIRST, with the mvn package command failing:
|
---|
76 |
|
---|
77 | git clone https://github.com/commoncrawl/ia-hadoop-tools
|
---|
78 | cd ia-hadoop-tools
|
---|
79 | mvn package
|
---|
80 |
|
---|
81 | Compile failed with a message about the JSONTokener constructor not taking a String object. It turned out that the JSONTokener used was a version of the class that was too old. Whereas the necessary constructor is present in the most recent version, as seen in the API at https://docs.oracle.com/cd/E51273_03/ToolsAndFrameworks.110/apidoc/assembler/com/endeca/serialization/json/JSONTokener.html
|
---|
82 |
|
---|
83 | So instead, I opened up ia-hadoop-tools/pom.xml for editing and added the newest version of the org.json package's json (see http://builds.archive.org/maven2/org/json/json/ for <version>) into the pom.xml's <dependencies> element, based on how it this was done at https://bukkit.org/threads/problem-loading-libraries-with-maven-java-lang-noclassdeffounderror-org-json-jsonobject.428354/:
|
---|
84 |
|
---|
85 | <dependency>
|
---|
86 | <groupId>org.json</groupId>
|
---|
87 | <artifactId>json</artifactId>
|
---|
88 | <version>20131018</version>
|
---|
89 | </dependency>
|
---|
90 |
|
---|
91 | Then I was able to run "mvn package" successfully.
|
---|
92 | (Maybe I could also have added in a far more recent version, as seen in the version numbers at https://mvnrepository.com/artifact/org.json/json,
|
---|
93 | but didn't want to go too far ahead in case there was other incompatibility.)
|
---|
94 |
|
---|
95 | 4. Next, I wanted to finally run the built executable to convert the warc files to wet files.
|
---|
96 |
|
---|
97 | I had the warc files on the hadoop filesystem. The original instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to however were apparently for working with warcs stored on the local filesystem, as those instructions did not run the hadoop command but the regular java command. The regular java command did not work with the files being on the hadoop system (attempt #1 below).
|
---|
98 |
|
---|
99 | ATTEMPTS THAT DIDN'T WORK:
|
---|
100 | 1. vagrant@node1:~/ia-hadoop-tools$ java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
|
---|
101 | 2. vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
|
---|
102 |
|
---|
103 |
|
---|
104 | The 2nd attempt, which uses a proper hadoop command, I based off https://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm
|
---|
105 | It produced lots of errors and the output wet (and wat) .gz files were all corrupt as gunzip could not successfully run over them:
|
---|
106 |
|
---|
107 | vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
|
---|
108 | 19/09/05 05:57:22 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
|
---|
109 | 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100139-000000.warc.gz
|
---|
110 | 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100141-000001.warc.gz
|
---|
111 | 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100451-000002.warc.gz
|
---|
112 | 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100453-000003.warc.gz
|
---|
113 | 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100805-000004.warc.gz
|
---|
114 | 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902100809-000005.warc.gz
|
---|
115 | 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000006.warc.gz
|
---|
116 | 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101119-000007.warc.gz
|
---|
117 | 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000008.warc.gz
|
---|
118 | 19/09/05 05:57:23 INFO jobs.WEATGenerator: Add input path: hdfs://node1/user/vagrant/cc-mri-subset/warc/MAORI-CC-2019-30-20190902101429-000009.warc.gz
|
---|
119 | 19/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
|
---|
120 | 19/09/05 05:57:23 INFO client.RMProxy: Connecting to ResourceManager at node1/127.0.0.1:8032
|
---|
121 | 19/09/05 05:57:23 INFO mapred.FileInputFormat: Total input paths to process : 10
|
---|
122 | 19/09/05 05:57:24 INFO mapreduce.JobSubmitter: number of splits:10
|
---|
123 | 19/09/05 05:57:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1567397114047_0001
|
---|
124 | 19/09/05 05:57:24 INFO impl.YarnClientImpl: Submitted application application_1567397114047_0001
|
---|
125 | 19/09/05 05:57:24 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1567397114047_0001/
|
---|
126 | 19/09/05 05:57:24 INFO mapreduce.Job: Running job: job_1567397114047_0001
|
---|
127 | 19/09/05 05:57:31 INFO mapreduce.Job: Job job_1567397114047_0001 running in uber mode : false
|
---|
128 | 19/09/05 05:57:31 INFO mapreduce.Job: map 0% reduce 0%
|
---|
129 | 19/09/05 05:57:44 INFO mapreduce.Job: map 10% reduce 0%
|
---|
130 | 19/09/05 05:57:44 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000002_0, Status : FAILED
|
---|
131 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
132 | Container killed by the ApplicationMaster.
|
---|
133 | Container killed on request. Exit code is 143
|
---|
134 | Container exited with a non-zero exit code 143
|
---|
135 |
|
---|
136 | 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000004_0, Status : FAILED
|
---|
137 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
138 | Container killed by the ApplicationMaster.
|
---|
139 | Container killed on request. Exit code is 143
|
---|
140 | Container exited with a non-zero exit code 143
|
---|
141 |
|
---|
142 | 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
|
---|
143 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
144 | 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000005_0, Status : FAILED
|
---|
145 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
146 | 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000000_0, Status : FAILED
|
---|
147 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
148 | 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000003_0, Status : FAILED
|
---|
149 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
150 | 19/09/05 05:57:46 INFO mapreduce.Job: map 0% reduce 0%
|
---|
151 | 19/09/05 05:57:54 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000007_0, Status : FAILED
|
---|
152 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
153 | 19/09/05 05:57:55 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000006_0, Status : FAILED
|
---|
154 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
155 | 19/09/05 05:57:57 INFO mapreduce.Job: map 10% reduce 0%
|
---|
156 | 19/09/05 05:57:57 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000009_0, Status : FAILED
|
---|
157 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
158 | Container killed by the ApplicationMaster.
|
---|
159 | Container killed on request. Exit code is 143
|
---|
160 | Container exited with a non-zero exit code 143
|
---|
161 |
|
---|
162 | 19/09/05 05:57:58 INFO mapreduce.Job: map 20% reduce 0%
|
---|
163 | 19/09/05 05:57:58 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000008_0, Status : FAILED
|
---|
164 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
165 | 19/09/05 05:58:06 INFO mapreduce.Job: map 30% reduce 0%
|
---|
166 | 19/09/05 05:58:08 INFO mapreduce.Job: map 60% reduce 0%
|
---|
167 | 19/09/05 05:58:09 INFO mapreduce.Job: map 70% reduce 0%
|
---|
168 | 19/09/05 05:58:10 INFO mapreduce.Job: map 80% reduce 0%
|
---|
169 | 19/09/05 05:58:12 INFO mapreduce.Job: map 90% reduce 0%
|
---|
170 | 19/09/05 05:58:13 INFO mapreduce.Job: map 100% reduce 0%
|
---|
171 | 19/09/05 05:58:13 INFO mapreduce.Job: Job job_1567397114047_0001 completed successfully
|
---|
172 | 19/09/05 05:58:13 INFO mapreduce.Job: Counters: 32
|
---|
173 | File System Counters
|
---|
174 | FILE: Number of bytes read=0
|
---|
175 | FILE: Number of bytes written=1239360
|
---|
176 | FILE: Number of read operations=0
|
---|
177 | FILE: Number of large read operations=0
|
---|
178 | FILE: Number of write operations=0
|
---|
179 | HDFS: Number of bytes read=1430
|
---|
180 | HDFS: Number of bytes written=0
|
---|
181 | HDFS: Number of read operations=30
|
---|
182 | HDFS: Number of large read operations=0
|
---|
183 | HDFS: Number of write operations=0
|
---|
184 | Job Counters
|
---|
185 | Failed map tasks=10
|
---|
186 | Launched map tasks=20
|
---|
187 | Other local map tasks=10
|
---|
188 | Data-local map tasks=10
|
---|
189 | Total time spent by all maps in occupied slots (ms)=208160
|
---|
190 | Total time spent by all reduces in occupied slots (ms)=0
|
---|
191 | Total time spent by all map tasks (ms)=208160
|
---|
192 | Total vcore-milliseconds taken by all map tasks=208160
|
---|
193 | Total megabyte-milliseconds taken by all map tasks=213155840
|
---|
194 | Map-Reduce Framework
|
---|
195 | Map input records=10
|
---|
196 | Map output records=0
|
---|
197 | Input split bytes=1430
|
---|
198 | Spilled Records=0
|
---|
199 | Failed Shuffles=0
|
---|
200 | Merged Map outputs=0
|
---|
201 | GC time elapsed (ms)=1461
|
---|
202 | CPU time spent (ms)=2490
|
---|
203 | Physical memory (bytes) snapshot=1564528640
|
---|
204 | Virtual memory (bytes) snapshot=19642507264
|
---|
205 | Total committed heap usage (bytes)=1126170624
|
---|
206 | File Input Format Counters
|
---|
207 | Bytes Read=0
|
---|
208 | File Output Format Counters
|
---|
209 | Bytes Written=0
|
---|
210 | vagrant@node1:~/ia-hadoop-tools$
|
---|
211 |
|
---|
212 |
|
---|
213 | 5. The error messages are all the same but not very informative
|
---|
214 | 19/09/05 05:57:45 INFO mapreduce.Job: Task Id : attempt_1567397114047_0001_m_000001_0, Status : FAILED
|
---|
215 | Error: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
|
---|
216 |
|
---|
217 | All the references I could find on google indicated that the full version of the error message was that this method (com.google.common.io.ByteStreams.limit(...)) could not be located.
|
---|
218 | The page at http://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%[email protected]%3E
|
---|
219 | revealed that guava.jar contains the com.google.common.io.ByteStreams class.
|
---|
220 |
|
---|
221 |
|
---|
222 | TO GET THE EXECUTABLE TO WORK:
|
---|
223 | I located guava.jar, found there were 2 identical ones on the filesystem but that neither was on the hadoop classpath yet, so I copied it into one of the Hadoop Classpath locations. Then I was able to successfully run the executable and produce meaningful WET files at last from the WARC input files:
|
---|
224 |
|
---|
225 |
|
---|
226 | vagrant@node1:~$ locate guava.jar
|
---|
227 | /usr/share/java/guava.jar
|
---|
228 | /usr/share/maven/lib/guava.jar
|
---|
229 | vagrant@node1:~$ jar -tvf /usr/share/maven/lib/guava.jar | less
|
---|
230 | vagrant@node1:~$ jar -tvf /usr/share/java/guava.jar | less
|
---|
231 | # both contained the ByteStreams class
|
---|
232 |
|
---|
233 | vagrant@node1:~$ cd -
|
---|
234 | /home/vagrant/ia-hadoop-tools
|
---|
235 | vagrant@node1:~/ia-hadoop-tools$ find . -name "guava.jar"
|
---|
236 | # None in the git project
|
---|
237 |
|
---|
238 | vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
|
---|
239 | /usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
|
---|
240 | # guava.jar not on hadoop classpath yet
|
---|
241 |
|
---|
242 | vagrant@node1:~/ia-hadoop-tools$ diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
|
---|
243 | # no differences, identical
|
---|
244 |
|
---|
245 | vagrant@node1:~/ia-hadoop-tools$ hdfs dfs -put /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
|
---|
246 | put: `/usr/local/hadoop/share/hadoop/common/.': No such file or directory
|
---|
247 | # hadoop classpath locations are not on the hdfs filesystem, but on the regular fs
|
---|
248 |
|
---|
249 | vagrant@node1:~/ia-hadoop-tools$ sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
|
---|
250 | vagrant@node1:~/ia-hadoop-tools$ hadoop classpath
|
---|
251 | /usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
|
---|
252 | # Copied guava.jar to somewhere on existing hadoop classpath
|
---|
253 |
|
---|
254 | vagrant@node1:~/ia-hadoop-tools$ $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/cc-mri-subset/warc/*.warc.gz
|
---|
255 | # Successful run
|
---|
256 |
|
---|
257 | vagrant@node1:~$ hdfs dfs -get /user/vagrant/cc-mri-subset/wet/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz /home/vagrant/.
|
---|
258 | vagrant@node1:~$ cd ..
|
---|
259 | vagrant@node1:~$ gunzip MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz
|
---|
260 | vagrant@node1:~$ less MAORI-CC-2019-30-20190902100139-000000.warc.wet
|
---|
261 | # Copied a WET output file from the hadoop filesystem to local filesystem and inspected its contents. Works!
|
---|
262 |
|
---|
263 | -----------------------------------
|
---|
264 | VIEW THE MRI-ONLY INDEX GENERATED
|
---|
265 | -----------------------------------
|
---|
266 | hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | tail -5
|
---|
267 |
|
---|
268 | (gz archive, binary file)
|
---|
269 |
|
---|
270 | vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -mkdir hdfs:///user/vagrant/cc-mri-unzipped-csv
|
---|
271 |
|
---|
272 | # https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop
|
---|
273 | XXX vagrant@node1:~/cc-index-table/src/script$ hadoop fs -cat hdfs:///user/vagrant/cc-mri-csv/part* | gzip -d | hadoop fs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv
|
---|
274 |
|
---|
275 |
|
---|
276 | vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | gzip -d | hdfs dfs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv
|
---|
277 | vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -ls hdfs:///user/vagrant/cc-mri-unzipped-csv
|
---|
278 | Found 1 items
|
---|
279 | -rw-r--r-- 1 vagrant supergroup 71664603 2019-08-29 04:47 hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv
|
---|
280 |
|
---|
281 | # https://stackoverflow.com/questions/14925323/view-contents-of-file-in-hdfs-hadoop
|
---|
282 | vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | tail -5
|
---|
283 |
|
---|
284 | # url, warc_filename, warc_record_offset, warc_record_length
|
---|
285 | http://paupauocean.com/page91?product_id=142&brd=1,crawl-data/CC-MAIN-2019-30/segments/1563195526940.0/warc/CC-MAIN-20190721082354-20190721104354-00088.warc.gz,115081770,21404
|
---|
286 | https://cookinseln-reisen.de/cook-inseln/rarotonga/,crawl-data/CC-MAIN-2019-30/segments/1563195526799.4/warc/CC-MAIN-20190720235054-20190721021054-00289.warc.gz,343512295,12444
|
---|
287 | http://www.halopharm.com/mi/profile/,crawl-data/CC-MAIN-2019-30/segments/1563195525500.21/warc/CC-MAIN-20190718042531-20190718064531-00093.warc.gz,219160333,10311
|
---|
288 | https://www.firstpeople.us/pictures/green/Touched-by-the-hand-of-Time-1907.html,crawl-data/CC-MAIN-2019-30/segments/1563195526670.1/warc/CC-MAIN-20190720194009-20190720220009-00362.warc.gz,696195242,5408
|
---|
289 | https://www.sos-accessoire.com/programmateur-programmateur-module-electronique-whirlpool-481231028062-27573.html,crawl-data/CC-MAIN-2019-30/segments/1563195527048.80/warc/CC-MAIN-20190721144008-20190721170008-00164.warc.gz,830087190,26321
|
---|
290 |
|
---|
291 | # https://stackoverflow.com/questions/32612867/how-to-count-lines-in-a-file-on-hdfs-command
|
---|
292 | vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l
|
---|
293 | 345625
|
---|
294 |
|
---|
295 |
|
---|
296 | ANOTHER WAY (DR BAINBRIDGE'S WAY) TO CREATE SINGLE .CSV FILE FROM /part* FILES AND VIEW ITS CONTENTS:
|
---|
297 | vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* > file.csv.gz
|
---|
298 | vagrant@node1:~/cc-index-table$ less file.csv.gz
|
---|
299 |
|
---|
300 |
|
---|
301 | https://www.patricia-anong.com/blog/2017/11/1/extend-vmdk-on-virtualbox
|
---|
302 |
|
---|
303 |
|
---|
304 | When not using LIKE '%mri%' but = 'mri' instead:
|
---|
305 | vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l
|
---|
306 | 5767
|
---|
307 |
|
---|
308 |
|
---|
309 | For a month later, the August 2019 crawl:
|
---|
310 | vagrant@node1:~$ hdfs dfs -cat hdfs:///user/vagrant/CC-MAIN-2019-35/cc-mri-unzipped-csv/cc-mri.csv | wc -l
|
---|
311 | 9318
|
---|
312 |
|
---|
313 | -----------------------------------------
|
---|
314 | Running export_mri_subset.sh
|
---|
315 | -----------------------------------------
|
---|
316 |
|
---|
317 | The export_mri_subset.sh script is set up run on the csv input file produced by running export_mri_index_csv.sh
|
---|
318 |
|
---|
319 | Running this initially produced the following exception:
|
---|
320 |
|
---|
321 |
|
---|
322 | 2019-08-29 05:48:52 INFO CCIndexExport:152 - Number of records/rows matched by query: 345624
|
---|
323 | 2019-08-29 05:48:52 INFO CCIndexExport:157 - Distributing 345624 records to 70 output partitions (max. 5000 records per WARC file)
|
---|
324 | 2019-08-29 05:48:52 INFO CCIndexExport:165 - Repartitioning data to 70 output partitions
|
---|
325 | Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`url`' given input columns: [http://176.31.110.213:600/?p=287, crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz, 1215489, 15675];;
|
---|
326 | 'Project ['url, 'warc_filename, 'warc_record_offset, 'warc_record_length]
|
---|
327 | +- AnalysisBarrier
|
---|
328 | +- Repartition 70, true
|
---|
329 | +- Relation[http://176.31.110.213:600/?p=287#10,crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz#11,1215489#12,15675#13] csv
|
---|
330 |
|
---|
331 | at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
|
---|
332 | at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
|
---|
333 | at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
|
---|
334 | at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
|
---|
335 | at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
|
---|
336 | at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
|
---|
337 | at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
|
---|
338 | at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
|
---|
339 | at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
|
---|
340 | at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
|
---|
341 | at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
|
---|
342 | at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
|
---|
343 | at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
|
---|
344 | at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
|
---|
345 | at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
|
---|
346 | at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
|
---|
347 | at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
|
---|
348 | at scala.collection.AbstractTraversable.map(Traversable.scala:104)
|
---|
349 | at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
|
---|
350 | at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
|
---|
351 | at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
|
---|
352 | at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
|
---|
353 | at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
|
---|
354 | at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
|
---|
355 | at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
|
---|
356 | at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
|
---|
357 | at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
|
---|
358 | at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
|
---|
359 | at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
|
---|
360 | at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
|
---|
361 | at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
|
---|
362 | at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
|
---|
363 | at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
|
---|
364 | at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3295)
|
---|
365 | at org.apache.spark.sql.Dataset.select(Dataset.scala:1307)
|
---|
366 | at org.apache.spark.sql.Dataset.select(Dataset.scala:1325)
|
---|
367 | at org.apache.spark.sql.Dataset.select(Dataset.scala:1325)
|
---|
368 | at org.commoncrawl.spark.examples.CCIndexWarcExport.run(CCIndexWarcExport.java:169)
|
---|
369 | at org.commoncrawl.spark.examples.CCIndexExport.run(CCIndexExport.java:192)
|
---|
370 | at org.commoncrawl.spark.examples.CCIndexWarcExport.main(CCIndexWarcExport.java:214)
|
---|
371 | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
---|
372 | at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
---|
373 | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
---|
374 | at java.lang.reflect.Method.invoke(Method.java:498)
|
---|
375 | at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
|
---|
376 | at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
|
---|
377 | at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
|
---|
378 | at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
|
---|
379 | at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
|
---|
380 | at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
|
---|
381 | 2019-08-29 05:48:52 INFO SparkContext:54 - Invoking stop() from shutdown hook
|
---|
382 |
|
---|
383 |
|
---|
384 |
|
---|
385 | Hints to solve it were at https://stackoverflow.com/questions/45972929/scala-dataframereader-keep-column-headers
|
---|
386 | The actual solution is to edit the CCIndexWarcExport.java as follows:
|
---|
387 | 1. set option(header) to false since the csv file contains no header row, only data rows. You can confirm the csv has no header row by doing
|
---|
388 | hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | head -5
|
---|
389 |
|
---|
390 | 2. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
|
---|
391 |
|
---|
392 | emacs src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java
|
---|
393 |
|
---|
394 | Change:
|
---|
395 | sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
|
---|
396 | .load(csvQueryResult);
|
---|
397 | To
|
---|
398 | sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
|
---|
399 | .load(csvQueryResult);
|
---|
400 |
|
---|
401 | And comment out:
|
---|
402 | //JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
|
---|
403 | .toJavaRDD();
|
---|
404 | Replace with the default inferred column names:
|
---|
405 | JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
|
---|
406 | .toJavaRDD();
|
---|
407 |
|
---|
408 |
|
---|
409 | Now recompile:
|
---|
410 | mvn package
|
---|
411 |
|
---|
412 | And run:
|
---|
413 | ./src/script/export_mri_subset.sh
|
---|
414 |
|
---|
415 | -------------------------
|
---|
416 |
|
---|
417 | WET example from https://github.com/commoncrawl/cc-warc-examples
|
---|
418 |
|
---|
419 | vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
|
---|
420 | vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
|
---|
421 | vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
|
---|
422 | Found 1 items
|
---|
423 | -rw-r--r-- 1 vagrant supergroup 154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
|
---|
424 | vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
|
---|
425 |
|
---|
426 | <ONCE FINISHED:>
|
---|
427 |
|
---|
428 | vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
|
---|
429 |
|
---|
430 |
|
---|
431 |
|
---|
432 | INFO ON HADOOP/HDFS:
|
---|
433 | https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
|
---|
434 |
|
---|
435 | SPARK:
|
---|
436 | configure option example: https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions
|
---|
437 |
|
---|
438 |
|
---|
439 |
|
---|
440 | LIKE '%isl%'
|
---|
441 |
|
---|
442 | cd cc-index-table
|
---|
443 | APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
|
---|
444 | > $SPARK_HOME/bin/spark-submit \
|
---|
445 | # $SPARK_ON_YARN \
|
---|
446 | --conf spark.hadoop.parquet.enable.dictionary=true \
|
---|
447 | --conf spark.hadoop.parquet.enable.summary-metadata=false \
|
---|
448 | --conf spark.sql.hive.metastorePartitionPruning=true \
|
---|
449 | --conf spark.sql.parquet.filterPushdown=true \
|
---|
450 | --conf spark.sql.parquet.mergeSchema=true \
|
---|
451 | --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \
|
---|
452 | --query "SELECT url, warc_filename, warc_record_offset, warc_record_length
|
---|
453 | FROM ccindex
|
---|
454 | WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
|
---|
455 | --numOutputPartitions 12 \
|
---|
456 | --numRecordsPerWarcFile 20000 \
|
---|
457 | --warcPrefix ICELANDIC-CC-2018-43 \
|
---|
458 | s3://commoncrawl/cc-index/table/cc-main/warc/ \
|
---|
459 | .../my_output_path/
|
---|
460 |
|
---|
461 |
|
---|
462 | ----
|
---|
463 | TIME
|
---|
464 | ----
|
---|
465 | 1. https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling
|
---|
466 | http://digitalpebble.blogspot.com/2017/03/need-billions-of-web-pages-dont-bother_29.html
|
---|
467 |
|
---|
468 | "So, not only have CommonCrawl given you loads of web data for free, theyâve also made your life easier by preprocessing the data for you. For many tasks, the content of the WAT or WET files will be sufficient and you wonât have to process the WARC files.
|
---|
469 |
|
---|
470 | This should not only help you simplify your code but also make the whole processing faster. We recently ran an experiment on CommonCrawl where we needed to extract anchor text from HTML pages. We initially wrote some MapReduce code to extract the binary content of the pages from their WARC representation, processed the HTML with JSoup and reduced on the anchor text. Processing a single WARC segment took roughly 100 minutes on a 10-node EMR cluster. We then simplified the extraction logic, took the WAT files as input and the processing time dropped to 17 minutes on the same cluster. This gain was partly due to not having to parse the web pages, but also to the fact that WAT files are a lot smaller than their WARC counterparts."
|
---|
471 |
|
---|
472 | 2. https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
|
---|
473 | "Both the parsing part and the processing part take just a couple of minutes per index file / WET file - the bulk of the âcomputeâ lies within actually downloading these files.
|
---|
474 |
|
---|
475 | Essentially if you have some time to spare and an unlimited Internet connection, all of this processing can be done on one powerful machine. You can be fancy and go ahead and rent some Amazon server(s) to minimize the download time, but that can be costly.
|
---|
476 |
|
---|
477 | In my experience - parsing the whole index for Russian websites (just filtering by language) takes approximately 140 hours - but the majority of this time is just downloading (my speed averaged ~300-500 kb/s)."
|
---|
478 |
|
---|
479 | ----
|
---|
480 | CMDS
|
---|
481 | ----
|
---|
482 | https://stackoverflow.com/questions/29565716/spark-kill-running-application
|
---|
483 |
|
---|
484 | =========================================================
|
---|
485 | Configuring spark to work on Amazon AWS s3a dataset:
|
---|
486 | =========================================================
|
---|
487 | https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
|
---|
488 | http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/
|
---|
489 | https://answers.dataiku.com/1734/common-crawl-s3
|
---|
490 | https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir
|
---|
491 | https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml
|
---|
492 |
|
---|
493 | https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w
|
---|
494 |
|
---|
495 |
|
---|
496 | https://sparkour.urizone.net/recipes/using-s3/
|
---|
497 | Configuring Spark to Use Amazon S3
|
---|
498 | "Some Spark tutorials show AWS access keys hardcoded into the file paths. This is a horribly insecure approach and should never be done. Use exported environment variables or IAM Roles instead, as described in Configuring Amazon S3 as a Spark Data Source."
|
---|
499 |
|
---|
500 | "No FileSystem for scheme: s3n
|
---|
501 |
|
---|
502 | java.io.IOException: No FileSystem for scheme: s3n
|
---|
503 |
|
---|
504 | This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you can use the --packages parameter and Spark will use Maven to locate the missing dependencies and distribute them to the cluster. Alternately, you can use --jars if you manually downloaded the dependencies already. These parameters also works on the spark-submit script."
|
---|
505 |
|
---|
506 | ===========================================
|
---|
507 | IAM Role (or user) and commoncrawl profile
|
---|
508 | ===========================================
|
---|
509 |
|
---|
510 | "iam" role or user for commoncrawl(er) profile
|
---|
511 |
|
---|
512 |
|
---|
513 | aws management console:
|
---|
514 | [email protected]
|
---|
515 | lab pwd, capital R and ! (maybe g)
|
---|
516 |
|
---|
517 | commoncrawl profile created while creating the user/role, by following the instructions at: https://answers.dataiku.com/1734/common-crawl-s3
|
---|
518 |
|
---|
519 | https://answers.dataiku.com/1734/common-crawl-s3
|
---|
520 | Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user:
|
---|
521 | #### START JSON (POLICY) ###
|
---|
522 | {
|
---|
523 | "Version": "2012-10-17",
|
---|
524 | "Statement": [
|
---|
525 | {
|
---|
526 | "Sid": "Stmt1503647467000",
|
---|
527 | "Effect": "Allow",
|
---|
528 | "Action": [
|
---|
529 | "s3:GetObject",
|
---|
530 | "s3:ListBucket"
|
---|
531 | ],
|
---|
532 | "Resource": [
|
---|
533 | "arn:aws:s3:::commoncrawl/*",
|
---|
534 | "arn:aws:s3:::commoncrawl"
|
---|
535 | ]
|
---|
536 | }
|
---|
537 | ]
|
---|
538 | }
|
---|
539 | #### END ###
|
---|
540 |
|
---|
541 | <!--
|
---|
542 | <property>
|
---|
543 | <name>fs.s3a.awsAccessKeyId</name>
|
---|
544 | <value>XXX</value>
|
---|
545 | </property>
|
---|
546 | <property>
|
---|
547 | <name>fs.s3a.awsSecretAccessKey</name>
|
---|
548 | <value>XXX</value>
|
---|
549 | </property>
|
---|
550 | -->
|
---|
551 |
|
---|
552 |
|
---|
553 | [If accesskey and secret were specified in hadoop core-site.xml and not in spark conf props file, then running export_maori_index_csv.sh produced the following error:
|
---|
554 |
|
---|
555 | 2019-08-29 06:16:38 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
|
---|
556 | 2019-08-29 06:16:40 WARN FileStreamSink:66 - Error while looking for metadata directory.
|
---|
557 | Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
|
---|
558 | at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
|
---|
559 | at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
|
---|
560 | at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
|
---|
561 | at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
|
---|
562 | at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
|
---|
563 | ]
|
---|
564 |
|
---|
565 | Instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml)
|
---|
566 |
|
---|
567 | you'll want to put the Amazon AWS access key and secret key in the spark properties file:
|
---|
568 |
|
---|
569 | sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
|
---|
570 |
|
---|
571 |
|
---|
572 | The spark properties conf file above should contain:
|
---|
573 |
|
---|
574 | spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
|
---|
575 | spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY
|
---|
576 | spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY
|
---|
577 |
|
---|
578 |
|
---|
579 |
|
---|
580 | When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
|
---|
581 |
|
---|
582 | -------------
|
---|
583 |
|
---|
584 | APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
|
---|
585 | $SPARK_HOME/bin/spark-submit \
|
---|
586 | --conf spark.hadoop.parquet.enable.dictionary=true \
|
---|
587 | --conf spark.hadoop.parquet.enable.summary-metadata=false \
|
---|
588 | --conf spark.sql.hive.metastorePartitionPruning=true \
|
---|
589 | --conf spark.sql.parquet.filterPushdown=true \
|
---|
590 | --conf spark.sql.parquet.mergeSchema=true \
|
---|
591 | --class org.commoncrawl.spark.examples.CCIndexExport $APPJAR \
|
---|
592 | --query "SELECT url, warc_filename, warc_record_offset, warc_record_length
|
---|
593 | FROM ccindex
|
---|
594 | WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
|
---|
595 | --outputFormat csv \
|
---|
596 | --numOutputPartitions 10 \
|
---|
597 | --outputCompression gzip \
|
---|
598 | s3://commoncrawl/cc-index/table/cc-main/warc/ \
|
---|
599 | hdfs:///user/vagrant/cc-mri-csv
|
---|
600 |
|
---|
601 | ----------------
|
---|
602 | Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
|
---|
603 |
|
---|
604 |
|
---|
605 | https://stackoverflow.com/questions/39355354/spark-no-filesystem-for-scheme-https-cannot-load-files-from-amazon-s3
|
---|
606 | https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3
|
---|
607 | "2018-01-10 Update Hadoop 3.0 has cut its s3: and s3n implementations: s3a is all you get. It is now significantly better than its predecessor and performs as least as good as the Amazon implementation. Amazon's "s3:" is still offered by EMR, which is their closed source client. Consult the EMR docs for more info."
|
---|
608 |
|
---|
609 | 1. https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
|
---|
610 |
|
---|
611 | "Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
|
---|
612 |
|
---|
613 | Here are the key parts, as of December 2015:
|
---|
614 |
|
---|
615 | Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
|
---|
616 |
|
---|
617 | You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
|
---|
618 |
|
---|
619 | You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
|
---|
620 |
|
---|
621 | In spark.properties you probably want some settings that look like this:
|
---|
622 |
|
---|
623 | spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
|
---|
624 | spark.hadoop.fs.s3a.access.key=ACCESSKEY
|
---|
625 | spark.hadoop.fs.s3a.secret.key=SECRETKEY
|
---|
626 |
|
---|
627 | I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them."
|
---|
628 |
|
---|
629 |
|
---|
630 | 2. The classpath used by hadoop can be found by running the command (https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath):
|
---|
631 | hadoop classpath
|
---|
632 |
|
---|
633 |
|
---|
634 | 3. Got hadoop-aws 2.7.6 jar
|
---|
635 | from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
|
---|
636 | and put it into /home/vagrant
|
---|
637 |
|
---|
638 |
|
---|
639 | 4. https://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath
|
---|
640 | https://stackoverflow.com/questions/28520821/how-to-add-external-jar-to-hadoop-job/54459211#54459211
|
---|
641 | vagrant@node1:~$ export LIBJARS=/home/vagrant/hadoop-aws-2.7.6.jar
|
---|
642 | vagrant@node1:~$ export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
|
---|
643 | vagrant@node1:~$ hadoop classpath
|
---|
644 |
|
---|
645 | 5. https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
|
---|
646 | "Download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory. You should see the aws-java-sdk-1.10.65.jar in /usr/hdp/2.4.0.0-169/hadoop/"
|
---|
647 |
|
---|
648 | I got version 1.11
|
---|
649 |
|
---|
650 | [Can't find a spark.properties file, but this seems to contain spark specific properties:
|
---|
651 | $SPARK_HOME/conf/spark-defaults.conf
|
---|
652 |
|
---|
653 | https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-properties.html
|
---|
654 | "The default Spark properties file is $SPARK_HOME/conf/spark-defaults.conf that could be overriden using spark-submit with the --properties-file command-line option."]
|
---|
655 |
|
---|
656 | Can SUDO COPY the 2 jar files hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar to:
|
---|
657 | /usr/local/hadoop/share/hadoop/common/
|
---|
658 | (else /usr/local/hadoop/share/hadoop/hdfs/hadoop-aws-2.7.6.jar)
|
---|
659 |
|
---|
660 | --------
|
---|
661 | schema
|
---|
662 | https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
|
---|
663 |
|
---|
664 | ---------------
|
---|
665 | More examples to try:
|
---|
666 | https://github.com/commoncrawl/cc-warc-examples
|
---|
667 |
|
---|
668 |
|
---|
669 | A bit outdated?
|
---|
670 | https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
|
---|
671 | https://www.journaldev.com/20261/apache-spark
|
---|
672 |
|
---|
673 | --------
|
---|
674 |
|
---|
675 | sudo apt-get install maven
|
---|
676 | (or sudo apt update
|
---|
677 | sudo apt install maven)
|
---|
678 | git clone https://github.com/commoncrawl/cc-index-table.git
|
---|
679 | cd cc-index-table
|
---|
680 | mvn package
|
---|
681 | vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
|
---|
682 |
|
---|
683 |
|
---|
684 |
|
---|
685 |
|
---|
686 | spark:
|
---|
687 | https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
|
---|
688 |
|
---|
689 | ============
|
---|
690 | Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
|
---|
691 |
|
---|
692 | https://github.com/martinprobson/vagrant-hadoop-hive-spark
|
---|
693 |
|
---|
694 | Vagrant:
|
---|
695 | * Guide: https://www.vagrantup.com/intro/getting-started/index.html
|
---|
696 | * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
|
---|
697 | * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
|
---|
698 | * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
|
---|
699 | * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
|
---|
700 | sudo apt-get -y install firefox
|
---|
701 | * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
|
---|
702 |
|
---|
703 | * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
|
---|
704 | * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
|
---|
705 | ---
|
---|
706 | ==> node1: Forwarding ports...
|
---|
707 | node1: 8080 (guest) => 8081 (host) (adapter 1)
|
---|
708 | node1: 8088 (guest) => 8089 (host) (adapter 1)
|
---|
709 | node1: 9083 (guest) => 9084 (host) (adapter 1)
|
---|
710 | node1: 4040 (guest) => 4041 (host) (adapter 1)
|
---|
711 | node1: 18888 (guest) => 18889 (host) (adapter 1)
|
---|
712 | node1: 16010 (guest) => 16011 (host) (adapter 1)
|
---|
713 | node1: 22 (guest) => 2200 (host) (adapter 1)
|
---|
714 | ==> node1: Running 'pre-boot' VM customizations...
|
---|
715 |
|
---|
716 |
|
---|
717 | ==> node1: Checking for guest additions in VM...
|
---|
718 | node1: The guest additions on this VM do not match the installed version of
|
---|
719 | node1: VirtualBox! In most cases this is fine, but in rare cases it can
|
---|
720 | node1: prevent things such as shared folders from working properly. If you see
|
---|
721 | node1: shared folder errors, please make sure the guest additions within the
|
---|
722 | node1: virtual machine match the version of VirtualBox you have installed on
|
---|
723 | node1: your host and reload your VM.
|
---|
724 | node1:
|
---|
725 | node1: Guest Additions Version: 5.1.38
|
---|
726 | node1: VirtualBox Version: 5.2
|
---|
727 |
|
---|
728 | ------------
|
---|