Context Navigation

source: gs3-extensions/maori-lang-detection/hdfs-cc-work/Readme.txt@ 33530

Last change on this file since 33530 was 33530, checked in by ak19, 5 years ago
Completed sentence that was left hanging.
File size: 20.8 KB

Line
1	----------------------------------------
2	INDEX: follow in sequence
3	----------------------------------------
4	A. VAGRANT VM WITH HADOOP AND SPARK
5	B. Create IAM role on Amazon AWS to use S3a
6	C. Configure Spark on your vagrant VM with the AWS authentication details
7	D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
8	E. Setup cc-index-table git project
9	F. Setup warc-to-wet tools (git projects)
10	G. Getting and running our scripts
11	----------------------------------------
12
13	----------------------------------------
14	A. VAGRANT VM WITH HADOOP AND SPARK
15	----------------------------------------
16	Set up vagrant with hadoop and spark as follows
17
18	1. by following the instructions at
19	https://github.com/martinprobson/vagrant-hadoop-hive-spark
20
21	This will eventually create the following folder, which will contain Vagrantfile
22	/home/<USER>/vagrant-hadoop-hive-spark
23
24	2. If there are other vagrant VMs set up according to the same instructions on the same machine, then need to change the forwarded ports (the 2nd column of ports) in the file "Vagrantfile". In the example below, excerpted from my Vagrantfile, I've incremented the forwarded ports by 1:
25
26	config.vm.network "forwarded_port", guest: 8080, host: 8081
27	config.vm.network "forwarded_port", guest: 8088, host: 8089
28	config.vm.network "forwarded_port", guest: 9083, host: 9084
29	config.vm.network "forwarded_port", guest: 4040, host: 4041
30	config.vm.network "forwarded_port", guest: 18888, host: 18889
31	config.vm.network "forwarded_port", guest: 16010, host: 16011
32
33	Remember to visit the adjusted ports on the running VM.
34
35	3. The most useful vagrant commands:
36	vagrant up # start up the vagrant VM if not already running.
37	# May need to provide VM's ID if there's more than one vagrant VM
38	ssh vagrant # ssh into the sole vagrant VM, else may need to provide vagrant VM's ID
39
40	vagrant halt # to shutdown the vagrant VM. Provide VM's ID if there's more than one vagrant VM.
41
42	(vagrant destroy) # to get rid of your vagrant VM. Useful if you've edited your Vagrantfile
43
44
45	4. Inside the VM, /home/<USER>/vagrant-hadoop-hive-spark will be shared and mounted as /vagrant
46	Remember, this is the folder containing Vagrantfile. It's easy to use the shared folder to transfer files between the VM and the actual machine that hosts it.
47
48	5. Install EMACS, FIREFOX AND MAVEN on the vagrant VM:
49	Start up vagrant machine ("vagrant up") and ssh into it ("ssh vagrant") if you haven't already.
50
51
52	a. sudo apt-get -y install firefox
53
54	b. sudo apt-get install emacs
55
56	c. sudo apt-get install maven
57	(or sudo apt update
58	sudo apt install maven)
59
60	Maven is needed for the commoncrawl github projects we'll be working with.
61
62
63	6. Although you can edit the Vagrantfile to have emacs and maven automatically installed when the vagrant VM is created, for firefox, you're advised to install it as above.
64
65	To be able to view firefox from the machine hosting the VM, use a separate terminal and run:
66	vagrant ssh -- -Y
67	[or "vagrant ssh -- -Y node1", if VM ID is node1]
68
69	READING ON Vagrant:
70	* Guide: https://www.vagrantup.com/intro/getting-started/index.html
71	* Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
72	* vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
73	* https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
74	* https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
75	sudo apt-get -y install firefox
76	* vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
77
78	* hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
79	* https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
80
81	-------------------------------------------------
82	B. Create IAM role on Amazon AWS to use S3 (S3a)
83	-------------------------------------------------
84	CommonCrawl (CC) crawl data is stored on Amazon S3, specifically the newest version Amazon s3a which has superceded both s3 and its earlier successor s3n.
85
86	In order to have access to cc crawl data, need to create an IAM role on Dr Bainbridge's Amazon AWS account and configure its profile for commoncrawl.
87
88	1. Log into Dr Bainbridge's Amazon AWS account
89	- In the aws management console:
90	[email protected]
91	lab pwd, capital R and ! (maybe g)
92
93
94	2. Create a new "iam" role or user for "commoncrawl(er)" profile
95
96	3. You can create the commoncrawl profile while creating the user/role, by following the instructions at https://answers.dataiku.com/1734/common-crawl-s3
97	which states
98
99	"Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user"
100
101	#### START POLICY IN JSON FORMAT ###
102	{
103	"Version": "2012-10-17",
104	"Statement": [
105	{
106	"Sid": "Stmt1503647467000",
107	"Effect": "Allow",
108	"Action": [
109	"s3:GetObject",
110	"s3:ListBucket"
111	],
112	"Resource": [
113	"arn:aws:s3:::commoncrawl/*",
114	"arn:aws:s3:::commoncrawl"
115	]
116	}
117	]
118	}
119	#### END POLICY ###
120
121
122	--------------------------------------------------------------------------
123	C. Configure Spark on your vagrant VM with the AWS authentication details
124	--------------------------------------------------------------------------
125	Any Spark jobs run against the CommonCrawl data stored on Amazon s3a need to be able to authenticate with the AWS IAM role you created above. In order to do this, you'll want to put the Amazon AWS access key and secret key in the SPARK configuration properties file. (Instead of configuring these values in hadoop's core-site.xml, as in the latter case, the authentication details don't get copied across when distributed jobs are run to other computers in the distributed cluster that also need to know how to authenticate):
126
127	1. Inside the vagrant vm:
128
129	sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
130	(sudo emacs $SPARK_HOME/conf/spark-defaults.conf)
131
132	2. Edit the spark properties conf file to contain these 3 new properties:
133
134	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
135	spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY_HERE
136	spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY_HERE
137
138	Instructions on which properties to set were taken from:
139	- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
140	- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
141
142	[NOTE, inactive alternative: Instead of editing spark's config file to set these properties, these properties can also be set in the bash script that executes the commoncrawl Spark jobs:
143
144	$SPARK_HOME/bin/spark-submit \
145	...
146	--conf spark.hadoop.fs.s3a.access.key=ACCESSKEY \
147	--conf spark.hadoop.fs.s3a.secret.key=SECRETKEY \
148	--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
149	...
150
151	But better not to hardcode authentication details into code, so I did it the first way.
152	]
153
154
155	----------------------------------------------------------------------
156	D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
157	----------------------------------------------------------------------
158	The following 2 pages state that additional steps are necessary to get hadoop and spark to work with AWS S3a:
159
160	- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
161	- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
162
163	I'm not sure whether these steps were really necessary in my case, and if so, whether it was A or B below that got things working for me. However, I have both A and B below set up.
164
165
166	A. Check your maven installation for necessary jars:
167
168	1. Installing maven may already have got the specifically recommended version of AWS-Java-SDK (aws-java-sdk-1.7.4.jar) and v2.7.6 hadoop-aws matching the vagrant VM's hadoop version (hadoop-aws-2.7.6.jar). Check these locations, as that's where I have them:
169	- /home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
170	- /home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar
171
172	The specifically recommended v.1.7.4 from the instructions can be found off https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 at https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
173
174	2. The script that runs the 2 Spark jobs uses the above paths for one of the spark jobs:
175	$SPARK_HOME/bin/spark-submit \
176	--jars file:/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar,file:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
177	--driver-class-path=/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
178
179	However the other Spark job in the script does not set --jars or --driver-class-path, despite also referring to the s3a://commoncrawl table. So I'm not sure whether the jars are necessary or whether theywere just being ignored when provided.
180
181	B. Download jar files and put them on the hadoop classpath:
182
183	1. download the jar files:
184	- I obtained aws-java-sdk-1.11.616.jar (v1.11) from https://aws.amazon.com/sdk-for-java/
185
186	- I downloaded hadoop-aws 2.7.6 jar, as it goes with my version of hadoop, from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
187
188	2. The easiest solution is to copy the 2 downloaded jars onto a location in the hadoop classpath.
189
190	a. The command that shows the paths present on the Hadoop CLASSPATH:
191	hadoop classpath
192	One of the paths this will list is /usr/local/hadoop-2.7.6/share/hadoop/common/
193
194	b. SUDO COPY the 2 downloaded jar files, hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar, to this location:
195
196	sudo cp hadoop-aws-2.7.6.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
197	sudo cp aws-java-sdk-1.11.616.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
198
199	Any hadoop jobs run will now find these 2 jar files on the classpath.
200
201	[NOTE, unused alternative: Instead of copying the 2 jar files into a system location, assuming they were downloaded into /home/vagrant/lib, you can also export a custom folder's jar files into the hadoop classpath from the bash script that runs the spark jobs. This had no effect for me, and was commented out, and is another reason why I'm not sure if the 2 jar files were even necessary.
202	#export LIBJARS=/home/vagrant/lib/*
203	#export HADOOP_CLASSPATH=`echo ${LIBJARS} \| sed s/,/:/g`
204	]
205
206
207	------------------------------------
208	E. Setup cc-index-table git project
209	------------------------------------
210	Need to be inside the vagrant VM.
211
212	1. Since you should have already installed maven, you can checkout and compile the cc-index-table git project.
213
214	git clone https://github.com/commoncrawl/cc-index-table.git
215
216	2. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below:
217
218	17c17,18
219	< <spark.version>2.4.1</spark.version>
220	---
221	> <!--<spark.version>2.4.1</spark.version>-->
222	> <spark.version>2.3.0</spark.version>
223	135a137,143
224	> <dependency>
225	> <groupId>org.apache.hadoop</groupId>
226	> <artifactId>hadoop-aws</artifactId>
227	> <version>2.7.6</version>
228	> </dependency>
229	>
230
231	3. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
232
233	a. Set option(header) to false, since the csv file contains no header row, only data rows.
234	Change:
235	sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
236	.load(csvQueryResult);
237	To
238	sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
239	.load(csvQueryResult);
240
241	b. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
242	Comment out:
243	//JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
244	.toJavaRDD();
245	Replace with the default inferred column names:
246	JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
247	.toJavaRDD();
248
249	// TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here.
250
251	4. Now (re)compile cc-index-table with the above modifications:
252
253	cd cc-index-table
254	mvn package
255
256	-------------------------------
257	F. Setup warc-to-wet tools
258	-------------------------------
259	To convert WARC files to WET (.warc.wet) files, need to checkout, set up and compile a couple more tools. These instructions are derived from those at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
260
261	1. Grab and compile the 2 git projects for converting warc to wet:
262	git clone https://github.com/commoncrawl/ia-web-commons
263	cd ia-web-commons
264	mvn install
265
266	git clone https://github.com/commoncrawl/ia-hadoop-tools
267	cd ia-hadoop-tools
268	# can't compile this yet
269
270
271	2. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json):
272
273	<dependency>
274	<groupId>org.json</groupId>
275	<artifactId>json</artifactId>
276	<version>20131018</version>
277	</dependency>
278
279	[
280	UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above:
281	a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml:
282	ia-hadoop-tools>diff pom.xml.orig pom.xml
283
284	< <groupId>org.netpreserve.commons</groupId>
285	< <artifactId>webarchive-commons</artifactId>
286	< <version>1.1.1-SNAPSHOT</version>
287	---
288	> <groupId>org.commoncrawl</groupId>
289	> <artifactId>ia-web-commons</artifactId>
290	> <version>1.1.9-SNAPSHOT</version>
291
292	b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
293
294	However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period.
295
296	ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/
297	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ
298	Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java
299	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ
300	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ
301	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ
302	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ
303	Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java
304	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ
305	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ
306	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ
307	Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ
308	Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ
309	Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ
310	]
311
312	3. Now can compile ia-hadoop-tools:
313	cd ia-hadoop-tools
314	mvn package
315
316	4. Can't run it until guava.jar is on hadoop classpath. Locate a guava.jar and put it into an existing location checked for by hadoop classpath:
317
318	locate guava.jar
319	# found in /usr/share/java/guava.jar and /usr/share/maven/lib/guava.jar
320	diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
321	# identical/no difference, so can use either
322	sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
323	# now guava.jar has been copied into a location on hadoop classpath
324
325
326	Having done the above, our bash script will now be able to convert WARC to WET files when it runs:
327	$HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/PATH/TO/warc/*.warc.gz
328	Our script expects a specific folder structure: there should be a "warc" folder (containing the warc files), which is supplied as above, but also an empty "wet" and "wat" folder at the same level as the "warc" folder.
329
330
331	When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
332
333	-----------------------------------
334	G. Getting and running our scripts
335	-----------------------------------
336
337	1. Grab our 1st bash script and put it into the /home/vagrant/cc-index-table/src/script:
338	cd cc-index-table/src/script
339	wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/bin/script/get_maori_WET_records_for_crawl.sh
340	chmod u+x get_maori_WET_records_for_crawl.sh
341
342	RUN AS:
343	cd cc-index-table
344	./src/script/get_maori_WET_records_for_crawl.sh <crawl-timestamp>
345	where crawl-timestamp of form "CC-MAIN-YYYY-##" >= September 2019
346
347	OUTPUT:
348	After hours of processing (leave it to run overnight), you should end up with:
349	hdfs dfs -ls /user/vagrant/<crawl-timestamp>
350	In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/
351	that we want would have been copied into /vagrant/<crawl-timestamp>-wet-files/
352
353
354	The script get_maori_WET_records_for_crawl.sh
355	- takes a crawl timestamp of the form "CC-MAIN-YYYY-##" from Sep 2018 onwards (before which content_languages were not indexed). The legitimate crawl timestampts are listed in the first column at http://index.commoncrawl.org/
356	- runs a spark job against CC's AWS bucket over s3a to create a csv table of MRI language records
357	- runs a spark job to download all the WARC records from CC's AWS that are denoted by the csv file's records into zipped warc files
358	- converts WARC to WET: locally converts the downloaded warc.gz files into warc.wet.gz (and warc.wat.gz) files
359
360
361	2. Grab our 2nd bash script and put it into the top level of the vagrant VM (/home/vagrant):
362
363	cd /home/vagrant
364	wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/bin/script/get_Maori_WET_records_from_CCSep2018_on.sh
365	chmod u+x get_Maori_WET_records_from_CCSep2018_on.sh
366
367	RUN AS:
368	./get_Maori_WET_records_from_CCSep2018_on.sh
369
370	This script just runs the 1st script cc-index-table/src/script/get_maori_WET_records_for_crawl.sh (above) to process all listed common-crawls since September 2018.
371	If any fails, then the script will terminate. Else it runs against each common-crawl in sequence.
372
373	NOTE: If needed, update the script with more recent crawl timestamps from http://index.commoncrawl.org/
374
375	OUTPUT:
376	After days of running, will end up with:
377	hdfs:///user/vagrant/<crawl-timestamp>/wet/
378	for each crawl-timestamp listed in the script,
379	which at present would have got copied into
380	/vagrant/<crawl-timestamp>-wet-files/
381
382	Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
383
384
385	-----------------------EOF------------------------
386

Note: See TracBrowser for help on using the repository browser.

Download in other formats: