source: gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT@ 33539

Last change on this file since 33539 was 33539, checked in by ak19, 5 years ago

File rename

File size: 21.2 KB
Line 
1----------------------------------------
2INDEX: follow in sequence
3----------------------------------------
4A. VAGRANT VM WITH HADOOP AND SPARK
5B. Create IAM role on Amazon AWS to use S3a
6C. Configure Spark on your vagrant VM with the AWS authentication details
7---
8Script scripts/setup.sh now is automated to do the steps in D-F below
9and prints out the main instruction for G.
10---
11D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
12E. Setup cc-index-table git project
13F. Setup warc-to-wet tools (git projects)
14G. Getting and running our scripts
15----------------------------------------
16
17----------------------------------------
18A. VAGRANT VM WITH HADOOP AND SPARK
19----------------------------------------
20Set up vagrant with hadoop and spark as follows
21
221. by following the instructions at
23https://github.com/martinprobson/vagrant-hadoop-hive-spark
24
25This will eventually create the following folder, which will contain Vagrantfile
26/home/<USER>/vagrant-hadoop-hive-spark
27
282. If there are other vagrant VMs set up according to the same instructions on the same machine, then need to change the forwarded ports (the 2nd column of ports) in the file "Vagrantfile". In the example below, excerpted from my Vagrantfile, I've incremented the forwarded ports by 1:
29
30 config.vm.network "forwarded_port", guest: 8080, host: 8081
31 config.vm.network "forwarded_port", guest: 8088, host: 8089
32 config.vm.network "forwarded_port", guest: 9083, host: 9084
33 config.vm.network "forwarded_port", guest: 4040, host: 4041
34 config.vm.network "forwarded_port", guest: 18888, host: 18889
35 config.vm.network "forwarded_port", guest: 16010, host: 16011
36
37Remember to visit the adjusted ports on the running VM.
38
393. The most useful vagrant commands:
40vagrant up # start up the vagrant VM if not already running.
41 # May need to provide VM's ID if there's more than one vagrant VM
42ssh vagrant # ssh into the sole vagrant VM, else may need to provide vagrant VM's ID
43
44vagrant halt # to shutdown the vagrant VM. Provide VM's ID if there's more than one vagrant VM.
45
46(vagrant destroy) # to get rid of your vagrant VM. Useful if you've edited your Vagrantfile
47
48
494. Inside the VM, /home/<USER>/vagrant-hadoop-hive-spark will be shared and mounted as /vagrant
50Remember, this is the folder containing Vagrantfile. It's easy to use the shared folder to transfer files between the VM and the actual machine that hosts it.
51
525. Install EMACS, FIREFOX AND MAVEN on the vagrant VM:
53Start up vagrant machine ("vagrant up") and ssh into it ("ssh vagrant") if you haven't already.
54
55
56a. sudo apt-get -y install firefox
57
58b. sudo apt-get install emacs
59
60c. sudo apt-get install maven
61 (or sudo apt update
62 sudo apt install maven)
63
64Maven is needed for the commoncrawl github projects we'll be working with.
65
66
676. Although you can edit the Vagrantfile to have emacs and maven automatically installed when the vagrant VM is created, for firefox, you're advised to install it as above.
68
69To be able to view firefox from the machine hosting the VM, use a separate terminal and run:
70 vagrant ssh -- -Y
71[or "vagrant ssh -- -Y node1", if VM ID is node1]
72
73READING ON Vagrant:
74 * Guide: https://www.vagrantup.com/intro/getting-started/index.html
75 * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
76 * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
77 * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
78 * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
79 sudo apt-get -y install firefox
80 * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
81
82 * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
83 * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
84
85-------------------------------------------------
86B. Create IAM role on Amazon AWS to use S3 (S3a)
87-------------------------------------------------
88CommonCrawl (CC) crawl data is stored on Amazon S3, specifically the newest version Amazon s3a which has superceded both s3 and its earlier successor s3n.
89
90In order to have access to cc crawl data, need to create an IAM role on Dr Bainbridge's Amazon AWS account and configure its profile for commoncrawl.
91
921. Log into Dr Bainbridge's Amazon AWS account
93- In the aws management console:
94[email protected]
95lab pwd, capital R and ! (maybe g)
96
97
982. Create a new "iam" role or user for "commoncrawl(er)" profile
99
1003. You can create the commoncrawl profile while creating the user/role, by following the instructions at https://answers.dataiku.com/1734/common-crawl-s3
101which states
102
103"Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user"
104
105#### START POLICY IN JSON FORMAT ###
106{
107 "Version": "2012-10-17",
108 "Statement": [
109 {
110 "Sid": "Stmt1503647467000",
111 "Effect": "Allow",
112 "Action": [
113 "s3:GetObject",
114 "s3:ListBucket"
115 ],
116 "Resource": [
117 "arn:aws:s3:::commoncrawl/*",
118 "arn:aws:s3:::commoncrawl"
119 ]
120 }
121 ]
122}
123#### END POLICY ###
124
125
126--------------------------------------------------------------------------
127C. Configure Spark on your vagrant VM with the AWS authentication details
128--------------------------------------------------------------------------
129Any Spark jobs run against the CommonCrawl data stored on Amazon s3a need to be able to authenticate with the AWS IAM role you created above. In order to do this, you'll want to put the Amazon AWS access key and secret key in the SPARK configuration properties file. (Instead of configuring these values in hadoop's core-site.xml, as in the latter case, the authentication details don't get copied across when distributed jobs are run to other computers in the distributed cluster that also need to know how to authenticate):
130
1311. Inside the vagrant vm:
132
133 sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
134 (sudo emacs $SPARK_HOME/conf/spark-defaults.conf)
135
1362. Edit the spark properties conf file to contain these 3 new properties:
137
138 spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
139 spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY_HERE
140 spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY_HERE
141
142Instructions on which properties to set were taken from:
143- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
144- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
145
146[NOTE, inactive alternative: Instead of editing spark's config file to set these properties, these properties can also be set in the bash script that executes the commoncrawl Spark jobs:
147
148$SPARK_HOME/bin/spark-submit \
149 ...
150 --conf spark.hadoop.fs.s3a.access.key=ACCESSKEY \
151 --conf spark.hadoop.fs.s3a.secret.key=SECRETKEY \
152 --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
153 ...
154
155But better not to hardcode authentication details into code, so I did it the first way.
156]
157
158----------------------------------------------------------------------
159NOTE:
160Script scripts/setup.sh now is automated to do the steps in D-F below
161and prints out the main instruction for G.
162
163
164----------------------------------------------------------------------
165D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
166----------------------------------------------------------------------
167The following 2 pages state that additional steps are necessary to get hadoop and spark to work with AWS S3a:
168
169- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
170- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
171
172I'm not sure whether these steps were really necessary in my case, and if so, whether it was A or B below that got things working for me. However, I have both A and B below set up.
173
174
175A. Check your maven installation for necessary jars:
176
1771. Installing maven may already have got the specifically recommended version of AWS-Java-SDK (aws-java-sdk-1.7.4.jar) and v2.7.6 hadoop-aws matching the vagrant VM's hadoop version (hadoop-aws-2.7.6.jar). Check these locations, as that's where I have them:
178- /home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
179- /home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar
180
181The specifically recommended v.1.7.4 from the instructions can be found off https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 at https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
182
1832. The script that runs the 2 Spark jobs uses the above paths for one of the spark jobs:
184 $SPARK_HOME/bin/spark-submit \
185 --jars file:/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar,file:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
186 --driver-class-path=/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
187
188However the other Spark job in the script does not set --jars or --driver-class-path, despite also referring to the s3a://commoncrawl table. So I'm not sure whether the jars are necessary or whether theywere just being ignored when provided.
189
190B. Download jar files and put them on the hadoop classpath:
191
1921. download the jar files:
193- I obtained aws-java-sdk-1.11.616.jar (v1.11) from https://aws.amazon.com/sdk-for-java/
194
195- I downloaded hadoop-aws 2.7.6 jar, as it goes with my version of hadoop, from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
196
1972. The easiest solution is to copy the 2 downloaded jars onto a location in the hadoop classpath.
198
199a. The command that shows the paths present on the Hadoop CLASSPATH:
200 hadoop classpath
201One of the paths this will list is /usr/local/hadoop-2.7.6/share/hadoop/common/
202
203b. SUDO COPY the 2 downloaded jar files, hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar, to this location:
204
205 sudo cp hadoop-aws-2.7.6.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
206 sudo cp aws-java-sdk-1.11.616.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
207
208Any hadoop jobs run will now find these 2 jar files on the classpath.
209
210[NOTE, unused alternative: Instead of copying the 2 jar files into a system location, assuming they were downloaded into /home/vagrant/lib, you can also export a custom folder's jar files into the hadoop classpath from the bash script that runs the spark jobs. This had no effect for me, and was commented out, and is another reason why I'm not sure if the 2 jar files were even necessary.
211#export LIBJARS=/home/vagrant/lib/*
212#export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
213]
214
215
216------------------------------------
217E. Setup cc-index-table git project
218------------------------------------
219Need to be inside the vagrant VM.
220
2211. Since you should have already installed maven, you can checkout and compile the cc-index-table git project.
222
223 git clone https://github.com/commoncrawl/cc-index-table.git
224
2252. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below:
226
22717c17,18
228< <spark.version>2.4.1</spark.version>
229---
230> <!--<spark.version>2.4.1</spark.version>-->
231> <spark.version>2.3.0</spark.version>
232135a137,143
233> <dependency>
234> <groupId>org.apache.hadoop</groupId>
235> <artifactId>hadoop-aws</artifactId>
236> <version>2.7.6</version>
237> </dependency>
238>
239
2403. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
241
242a. Set option(header) to false, since the csv file contains no header row, only data rows.
243 Change:
244 sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
245 .load(csvQueryResult);
246 To
247 sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
248 .load(csvQueryResult);
249
250b. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
251 Comment out:
252 //JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
253 .toJavaRDD();
254 Replace with the default inferred column names:
255 JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
256 .toJavaRDD();
257
258// TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here.
259
2604. Now (re)compile cc-index-table with the above modifications:
261
262 cd cc-index-table
263 mvn package
264
265-------------------------------
266F. Setup warc-to-wet tools
267-------------------------------
268To convert WARC files to WET (.warc.wet) files, need to checkout, set up and compile a couple more tools. These instructions are derived from those at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
269
2701. Grab and compile the 2 git projects for converting warc to wet:
271 git clone https://github.com/commoncrawl/ia-web-commons
272 cd ia-web-commons
273 mvn install
274
275 git clone https://github.com/commoncrawl/ia-hadoop-tools
276 cd ia-hadoop-tools
277 # can't compile this yet
278
279
2802. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json):
281
282 <dependency>
283 <groupId>org.json</groupId>
284 <artifactId>json</artifactId>
285 <version>20131018</version>
286 </dependency>
287
288[
289 UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above:
290 a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml:
291 ia-hadoop-tools>diff pom.xml.orig pom.xml
292
293 < <groupId>org.netpreserve.commons</groupId>
294 < <artifactId>webarchive-commons</artifactId>
295 < <version>1.1.1-SNAPSHOT</version>
296 ---
297 > <groupId>org.commoncrawl</groupId>
298 > <artifactId>ia-web-commons</artifactId>
299 > <version>1.1.9-SNAPSHOT</version>
300
301 b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
302
303 However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period.
304
305 ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/
306 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ
307 Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java
308 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ
309 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ
310 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ
311 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ
312 Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java
313 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ
314 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ
315 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ
316 Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ
317 Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ
318 Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ
319]
320
3213. Now can compile ia-hadoop-tools:
322 cd ia-hadoop-tools
323 mvn package
324
3254. Can't run it until guava.jar is on hadoop classpath. Locate a guava.jar and put it into an existing location checked for by hadoop classpath:
326
327 locate guava.jar
328 # found in /usr/share/java/guava.jar and /usr/share/maven/lib/guava.jar
329 diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
330 # identical/no difference, so can use either
331 sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
332 # now guava.jar has been copied into a location on hadoop classpath
333
334
335Having done the above, our bash script will now be able to convert WARC to WET files when it runs:
336 $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/PATH/TO/warc/*.warc.gz
337Our script expects a specific folder structure: there should be a "warc" folder (containing the warc files), which is supplied as above, but also an empty "wet" and "wat" folder at the same level as the "warc" folder.
338
339
340When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
341
342-----------------------------------
343G. Getting and running our scripts
344-----------------------------------
345
3461. Grab our 1st bash script and put it into the /home/vagrant/cc-index-table/src/script:
347 cd cc-index-table/src/script
348 wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_maori_WET_records_for_crawl.sh
349 chmod u+x get_maori_WET_records_for_crawl.sh
350
351RUN AS:
352cd cc-index-table
353./src/script/get_maori_WET_records_for_crawl.sh <crawl-timestamp>
354 where crawl-timestamp of form "CC-MAIN-YYYY-##" >= September 2019
355
356OUTPUT:
357After hours of processing (leave it to run overnight), you should end up with:
358 hdfs dfs -ls /user/vagrant/<crawl-timestamp>
359In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/
360that we want would have been copied into /vagrant/<crawl-timestamp>-wet-files/
361
362
363The script get_maori_WET_records_for_crawl.sh
364- takes a crawl timestamp of the form "CC-MAIN-YYYY-##" from Sep 2018 onwards (before which content_languages were not indexed). The legitimate crawl timestampts are listed in the first column at http://index.commoncrawl.org/
365- runs a spark job against CC's AWS bucket over s3a to create a csv table of MRI language records
366- runs a spark job to download all the WARC records from CC's AWS that are denoted by the csv file's records into zipped warc files
367- converts WARC to WET: locally converts the downloaded warc.gz files into warc.wet.gz (and warc.wat.gz) files
368
369
3702. Grab our 2nd bash script and put it into the top level of cc-index-table (/home/vagrant/cc-index/table):
371
372 cd cc-index-table
373 wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_Maori_WET_records_from_CCSep2018_on.sh
374 chmod u+x get_Maori_WET_records_from_CCSep2018_on.sh
375
376RUN FROM cc-index-table DIRECTORY AS:
377 (cd cc-index-table)
378 ./get_Maori_WET_records_from_CCSep2018_on.sh
379
380This script just runs the 1st script cc-index-table/src/script/get_maori_WET_records_for_crawl.sh (above) to process all listed common-crawls since September 2018.
381If any fails, then the script will terminate. Else it runs against each common-crawl in sequence.
382
383NOTE: If needed, update the script with more recent crawl timestamps from http://index.commoncrawl.org/
384
385OUTPUT:
386After days of running, will end up with:
387 hdfs:///user/vagrant/<crawl-timestamp>/wet/
388for each crawl-timestamp listed in the script,
389which at present would have got copied into
390 /vagrant/<crawl-timestamp>-wet-files/
391
392Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
393
394
395-----------------------EOF------------------------
396
Note: See TracBrowser for help on using the repository browser.