Context Navigation

source: other-projects/maori-lang-detection/hdfs-cc-work/GS_README.TXT@ 33809

Last change on this file since 33809 was 33809, checked in by ak19, 4 years ago
Some more GS_README.txt instructions. Not put the mongodb queries in here yet. They're still in MoreReading/mongodb.txt, but the final quweries that are useful will end up in this file later on.
File size: 41.3 KB

Line
1	----------------------------------------
2	INDEX: follow in sequence
3	----------------------------------------
4	A. VAGRANT VM WITH HADOOP AND SPARK
5	B. Create IAM role on Amazon AWS to use S3a
6	C. Configure Spark on your vagrant VM with the AWS authentication details
7	---
8	Script scripts/setup.sh now is automated to do the steps in D-F below
9	and prints out the main instruction for G.
10	---
11	D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
12	E. Setup cc-index-table git project
13	F. Setup warc-to-wet tools (git projects)
14	G. Getting and running our scripts
15	---
16	H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
17	I. Setting up Nutch v2 on its own Vagrant VM machine
18	J. Automated crawling with Nutch v2.3.1 and post-processing
19	K. Sending the crawled data into mongodb with NutchTextDumpProcessor.java
20	---
21
22	APPENDIX: Reading data from hbase tables and backing up hbase
23
24	----------------------------------------
25
26	----------------------------------------
27	A. VAGRANT VM WITH HADOOP AND SPARK
28	----------------------------------------
29	Set up vagrant with hadoop and spark as follows
30
31	1. by following the instructions at
32	https://github.com/martinprobson/vagrant-hadoop-hive-spark
33
34	This will eventually create the following folder, which will contain Vagrantfile
35	/home/<USER>/vagrant-hadoop-hive-spark
36
37	2. If there are other vagrant VMs set up according to the same instructions on the same machine, then need to change the forwarded ports (the 2nd column of ports) in the file "Vagrantfile". In the example below, excerpted from my Vagrantfile, I've incremented the forwarded ports by 1:
38
39	config.vm.network "forwarded_port", guest: 8080, host: 8081
40	config.vm.network "forwarded_port", guest: 8088, host: 8089
41	config.vm.network "forwarded_port", guest: 9083, host: 9084
42	config.vm.network "forwarded_port", guest: 4040, host: 4041
43	config.vm.network "forwarded_port", guest: 18888, host: 18889
44	config.vm.network "forwarded_port", guest: 16010, host: 16011
45
46	Remember to visit the adjusted ports on the running VM.
47
48	3. The most useful vagrant commands:
49	vagrant up # start up the vagrant VM if not already running.
50	# May need to provide VM's ID if there's more than one vagrant VM
51	ssh vagrant # ssh into the sole vagrant VM, else may need to provide vagrant VM's ID
52
53	vagrant halt # to shutdown the vagrant VM. Provide VM's ID if there's more than one vagrant VM.
54
55	(vagrant destroy) # to get rid of your vagrant VM. Useful if you've edited your Vagrantfile
56
57
58	4. Inside the VM, /home/<USER>/vagrant-hadoop-hive-spark will be shared and mounted as /vagrant
59	Remember, this is the folder containing Vagrantfile. It's easy to use the shared folder to transfer files between the VM and the actual machine that hosts it.
60
61	5. Install EMACS, FIREFOX AND MAVEN on the vagrant VM:
62	Start up vagrant machine ("vagrant up") and ssh into it ("ssh vagrant") if you haven't already.
63
64
65	a. sudo apt-get -y install firefox
66
67	b. sudo apt-get install emacs
68
69	c. sudo apt-get install maven
70	(or sudo apt update
71	sudo apt install maven)
72
73	Maven is needed for the commoncrawl github projects we'll be working with.
74
75
76	6. Although you can edit the Vagrantfile to have emacs and maven automatically installed when the vagrant VM is created, for firefox, you're advised to install it as above.
77
78	To be able to view firefox from the machine hosting the VM, use a separate terminal and run:
79	vagrant ssh -- -Y
80	[or "vagrant ssh -- -Y node1", if VM ID is node1]
81
82	READING ON Vagrant:
83	* Guide: https://www.vagrantup.com/intro/getting-started/index.html
84	* Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
85	* vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
86	* https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
87	* https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
88	sudo apt-get -y install firefox
89	* vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
90
91	* hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
92	* https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
93
94	-------------------------------------------------
95	B. Create IAM role on Amazon AWS to use S3 (S3a)
96	-------------------------------------------------
97	CommonCrawl (CC) crawl data is stored on Amazon S3, specifically the newest version Amazon s3a which has superceded both s3 and its earlier successor s3n.
98
99	In order to have access to cc crawl data, need to create an IAM role on Dr Bainbridge's Amazon AWS account and configure its profile for commoncrawl.
100
101	1. Log into Dr Bainbridge's Amazon AWS account
102	- In the aws management console:
103	[email protected]
104	lab pwd, capital R and ! (maybe g)
105
106
107	2. Create a new "iam" role or user for "commoncrawl(er)" profile
108
109	3. You can create the commoncrawl profile while creating the user/role, by following the instructions at https://answers.dataiku.com/1734/common-crawl-s3
110	which states
111
112	"Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user"
113
114	#### START POLICY IN JSON FORMAT ###
115	{
116	"Version": "2012-10-17",
117	"Statement": [
118	{
119	"Sid": "Stmt1503647467000",
120	"Effect": "Allow",
121	"Action": [
122	"s3:GetObject",
123	"s3:ListBucket"
124	],
125	"Resource": [
126	"arn:aws:s3:::commoncrawl/*",
127	"arn:aws:s3:::commoncrawl"
128	]
129	}
130	]
131	}
132	#### END POLICY ###
133
134
135	--------------------------------------------------------------------------
136	C. Configure Spark on your vagrant VM with the AWS authentication details
137	--------------------------------------------------------------------------
138	Any Spark jobs run against the CommonCrawl data stored on Amazon s3a need to be able to authenticate with the AWS IAM role you created above. In order to do this, you'll want to put the Amazon AWS access key and secret key in the SPARK configuration properties file. (Instead of configuring these values in hadoop's core-site.xml, as in the latter case, the authentication details don't get copied across when distributed jobs are run to other computers in the distributed cluster that also need to know how to authenticate):
139
140	1. Inside the vagrant vm:
141
142	sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
143	(sudo emacs $SPARK_HOME/conf/spark-defaults.conf)
144
145	2. Edit the spark properties conf file to contain these 3 new properties:
146
147	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
148	spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY_HERE
149	spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY_HERE
150
151	Instructions on which properties to set were taken from:
152	- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
153	- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
154
155	[NOTE, inactive alternative: Instead of editing spark's config file to set these properties, these properties can also be set in the bash script that executes the commoncrawl Spark jobs:
156
157	$SPARK_HOME/bin/spark-submit \
158	...
159	--conf spark.hadoop.fs.s3a.access.key=ACCESSKEY \
160	--conf spark.hadoop.fs.s3a.secret.key=SECRETKEY \
161	--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
162	...
163
164	But better not to hardcode authentication details into code, so I did it the first way.
165	]
166
167	----------------------------------------------------------------------
168	NOTE:
169	Script scripts/setup.sh now is automated to do the steps in D-F below
170	and prints out the main instruction for G.
171
172
173	----------------------------------------------------------------------
174	D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
175	----------------------------------------------------------------------
176	The following 2 pages state that additional steps are necessary to get hadoop and spark to work with AWS S3a:
177
178	- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
179	- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
180
181	I'm not sure whether these steps were really necessary in my case, and if so, whether it was A or B below that got things working for me. However, I have both A and B below set up.
182
183
184	A. Check your maven installation for necessary jars:
185
186	1. Installing maven may already have got the specifically recommended version of AWS-Java-SDK (aws-java-sdk-1.7.4.jar) and v2.7.6 hadoop-aws matching the vagrant VM's hadoop version (hadoop-aws-2.7.6.jar). Check these locations, as that's where I have them:
187	- /home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
188	- /home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar
189
190	The specifically recommended v.1.7.4 from the instructions can be found off https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 at https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
191
192	2. The script that runs the 2 Spark jobs uses the above paths for one of the spark jobs:
193	$SPARK_HOME/bin/spark-submit \
194	--jars file:/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar,file:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
195	--driver-class-path=/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
196
197	However the other Spark job in the script does not set --jars or --driver-class-path, despite also referring to the s3a://commoncrawl table. So I'm not sure whether the jars are necessary or whether theywere just being ignored when provided.
198
199	B. Download jar files and put them on the hadoop classpath:
200
201	1. download the jar files:
202	- I obtained aws-java-sdk-1.11.616.jar (v1.11) from https://aws.amazon.com/sdk-for-java/
203
204	- I downloaded hadoop-aws 2.7.6 jar, as it goes with my version of hadoop, from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
205
206	2. The easiest solution is to copy the 2 downloaded jars onto a location in the hadoop classpath.
207
208	a. The command that shows the paths present on the Hadoop CLASSPATH:
209	hadoop classpath
210	One of the paths this will list is /usr/local/hadoop-2.7.6/share/hadoop/common/
211
212	b. SUDO COPY the 2 downloaded jar files, hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar, to this location:
213
214	sudo cp hadoop-aws-2.7.6.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
215	sudo cp aws-java-sdk-1.11.616.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
216
217	Any hadoop jobs run will now find these 2 jar files on the classpath.
218
219	[NOTE, unused alternative: Instead of copying the 2 jar files into a system location, assuming they were downloaded into /home/vagrant/lib, you can also export a custom folder's jar files into the hadoop classpath from the bash script that runs the spark jobs. This had no effect for me, and was commented out, and is another reason why I'm not sure if the 2 jar files were even necessary.
220	#export LIBJARS=/home/vagrant/lib/*
221	#export HADOOP_CLASSPATH=`echo ${LIBJARS} \| sed s/,/:/g`
222	]
223
224
225	------------------------------------
226	E. Setup cc-index-table git project
227	------------------------------------
228	Need to be inside the vagrant VM.
229
230	1. Since you should have already installed maven, you can checkout and compile the cc-index-table git project.
231
232	git clone https://github.com/commoncrawl/cc-index-table.git
233
234	2. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below:
235
236	17c17,18
237	< <spark.version>2.4.1</spark.version>
238	---
239	> <!--<spark.version>2.4.1</spark.version>-->
240	> <spark.version>2.3.0</spark.version>
241	135a137,143
242	> <dependency>
243	> <groupId>org.apache.hadoop</groupId>
244	> <artifactId>hadoop-aws</artifactId>
245	> <version>2.7.6</version>
246	> </dependency>
247	>
248
249	3. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
250
251	a. Set option(header) to false, since the csv file contains no header row, only data rows.
252	Change:
253	sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
254	.load(csvQueryResult);
255	To
256	sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
257	.load(csvQueryResult);
258
259	b. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
260	Comment out:
261	//JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
262	.toJavaRDD();
263	Replace with the default inferred column names:
264	JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
265	.toJavaRDD();
266
267	// TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here.
268
269	4. Now (re)compile cc-index-table with the above modifications:
270
271	cd cc-index-table
272	mvn package
273
274	-------------------------------
275	F. Setup warc-to-wet tools
276	-------------------------------
277	To convert WARC files to WET (.warc.wet) files, need to checkout, set up and compile a couple more tools. These instructions are derived from those at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
278
279	1. Grab and compile the 2 git projects for converting warc to wet:
280	git clone https://github.com/commoncrawl/ia-web-commons
281	cd ia-web-commons
282	mvn install
283
284	git clone https://github.com/commoncrawl/ia-hadoop-tools
285	cd ia-hadoop-tools
286	# can't compile this yet
287
288
289	2. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json):
290
291	<dependency>
292	<groupId>org.json</groupId>
293	<artifactId>json</artifactId>
294	<version>20131018</version>
295	</dependency>
296
297	[
298	UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above:
299	a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml:
300	ia-hadoop-tools>diff pom.xml.orig pom.xml
301
302	< <groupId>org.netpreserve.commons</groupId>
303	< <artifactId>webarchive-commons</artifactId>
304	< <version>1.1.1-SNAPSHOT</version>
305	---
306	> <groupId>org.commoncrawl</groupId>
307	> <artifactId>ia-web-commons</artifactId>
308	> <version>1.1.9-SNAPSHOT</version>
309
310	b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
311
312	However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period.
313
314	ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/
315	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ
316	Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java
317	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ
318	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ
319	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ
320	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ
321	Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java
322	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ
323	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ
324	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ
325	Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ
326	Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ
327	Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ
328	]
329
330	3. Now can compile ia-hadoop-tools:
331	cd ia-hadoop-tools
332	mvn package
333
334	4. Can't run it until guava.jar is on hadoop classpath. Locate a guava.jar and put it into an existing location checked for by hadoop classpath:
335
336	locate guava.jar
337	# found in /usr/share/java/guava.jar and /usr/share/maven/lib/guava.jar
338	diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
339	# identical/no difference, so can use either
340	sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
341	# now guava.jar has been copied into a location on hadoop classpath
342
343
344	Having done the above, our bash script will now be able to convert WARC to WET files when it runs:
345	$HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/PATH/TO/warc/*.warc.gz
346	Our script expects a specific folder structure: there should be a "warc" folder (containing the warc files), which is supplied as above, but also an empty "wet" and "wat" folder at the same level as the "warc" folder.
347
348
349	When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
350
351	-----------------------------------
352	G. Getting and running our scripts
353	-----------------------------------
354
355	1. Grab our 1st bash script and put it into the /home/vagrant/cc-index-table/src/script:
356	cd cc-index-table/src/script
357	wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_maori_WET_records_for_crawl.sh
358	chmod u+x get_maori_WET_records_for_crawl.sh
359
360	RUN AS:
361	cd cc-index-table
362	./src/script/get_maori_WET_records_for_crawl.sh <crawl-timestamp>
363	where crawl-timestamp of form "CC-MAIN-YYYY-##" >= September 2019
364
365	OUTPUT:
366	After hours of processing (leave it to run overnight), you should end up with:
367	hdfs dfs -ls /user/vagrant/<crawl-timestamp>
368	In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/
369	that we want would have been copied into /vagrant/<crawl-timestamp>-wet-files/
370
371
372	The script get_maori_WET_records_for_crawl.sh
373	- takes a crawl timestamp of the form "CC-MAIN-YYYY-##" from Sep 2018 onwards (before which content_languages were not indexed). The legitimate crawl timestampts are listed in the first column at http://index.commoncrawl.org/
374	- runs a spark job against CC's AWS bucket over s3a to create a csv table of MRI language records
375	- runs a spark job to download all the WARC records from CC's AWS that are denoted by the csv file's records into zipped warc files
376	- converts WARC to WET: locally converts the downloaded warc.gz files into warc.wet.gz (and warc.wat.gz) files
377
378
379	2. Grab our 2nd bash script and put it into the top level of cc-index-table (/home/vagrant/cc-index/table):
380
381	cd cc-index-table
382	wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_Maori_WET_records_from_CCSep2018_on.sh
383	chmod u+x get_Maori_WET_records_from_CCSep2018_on.sh
384
385	RUN FROM cc-index-table DIRECTORY AS:
386	(cd cc-index-table)
387	./get_Maori_WET_records_from_CCSep2018_on.sh
388
389	This script just runs the 1st script cc-index-table/src/script/get_maori_WET_records_for_crawl.sh (above) to process all listed common-crawls since September 2018.
390	If any fails, then the script will terminate. Else it runs against each common-crawl in sequence.
391
392	NOTE: If needed, update the script with more recent crawl timestamps from http://index.commoncrawl.org/
393
394	OUTPUT:
395	After days of running, will end up with:
396	hdfs:///user/vagrant/<crawl-timestamp>/wet/
397	for each crawl-timestamp listed in the script,
398	which at present would have got copied into
399	/vagrant/<crawl-timestamp>-wet-files/
400
401	Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
402
403	-----------------------------------
404	H. Austici crawl
405	-----------------------------------
406	Austici's crawl: CLI to download web sites as WARCs, features basics to avoid crawler taps.
407
408	Out of several software to do site mirroring, Autistici's "crawl" seemed promising:
409	https://anarc.at/services/archive/web/
410
411	- CLI.
412	- Can download a website quite simply, though flags for additional settings are available.
413	- Coded to prevent common traps.
414	- Downloads website as WARC file
415	- Now I have the WARC to WET process working for the WARC file it produced for the usual test site (Dr Bainbridge's home page)
416
417	Need to have Go installed in order to install and run Autistici's crawl.
418	Not a problem, because I can do it on the remote machine (which also hosts the hdfs) where I have sudo powers.
419
420	INSTRUCTIONS
421
422	1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
423	2. Create go environment:
424	#!/bin/bash
425	# environment vars for golang
426	export GOROOT=/usr/local/go
427	export GOPATH=$HOME/go
428	export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
429	3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
430
431	These steps work:
432
433	cd $GOPATH
434	mkdir bin
435	mkdir src
436	cd src
437
438	4. Since trying to go install the crawl url didn't work
439	https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
440	[https://stackoverflow.com/questions/26694271/go-install-doesnt-create-any-bin-file]
441
442	vagrant@node2:~/go/src$
443	mkdir -p git.autistici.org/ale
444	cd git.autistici.org/ale
445	git clone https://git.autistici.org/ale/crawl.git
446
447	[Now can run the install command in README.md:]
448	cd $GOPATH/src
449	go install git.autistici.org/ale/crawl/cmd/crawl
450
451	Now we should have a $GOPATH/bin folder containing the "crawl" binary
452
453	5. Run a crawl:
454	cd $GOPATH/bin
455	./crawl https://www.cs.waikato.ac.nz/~davidb/
456
457	which downloads the site and puts the warc file into the $GOPATH/bin folder.
458
459	More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
460
461	6. To view the RAW contents of a WARC file:
462	https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
463
464	zless <warc-file-name>
465
466	zless already installed on vagrant file
467
468
469	-----------------------------------------------------------------------------------------------
470	How to run warc-to-wet conversion on sites downloaded as WARCs by Austici's "crawl"
471	-----------------------------------------------------------------------------------------------
472	ISSUES CONVERTING WARC to WET:
473	---
474	WARC files produced by Autistici crawl are of a somewhat different format to CommonCrawl WARCs.
475	- missing elements in header
476	- different header elements
477	- ordering different (if that matters)
478
479	But WET is an official format, not CommonCrawl specific, as indicated by
480
481	https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
482	"WET (parsed text)
483
484	WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format."
485
486	So must be possible to get WARC to WET conversion used for CommonCrawl data to work on Autistici crawl's WARC files.
487
488
489	RESOLUTION:
490	---
491	I made changes to 2 java source files in the 2 github projects ia-web-commons and ia-hadoop-tools, which we use for the WARC to WET processing of CommonCrawl data. These gitprojects (with modifications for commoncrawl) are already on http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/gitprojects.
492
493	The changed files are as follows:
494	1. patches/WATExtractorOutput.java
495	put into ia-web-commons/src/main/java/org/archive/extract
496	after renaming existing to .orig
497
498	THEN RECOMPILE ia-web-commons with:
499	mvn install
500
501	2. patches/GZRangeClient.java
502	put into ia-hadoop-tools/src/main/java/org/archive/server
503	after renaming existing to .orig
504
505	THEN RECOMPILE ia-hadoop-tools with:
506	mvn package
507
508	Make sure to first compile ia-web-commons, then ia-hadoop-tools.
509
510
511	The modifications made to the above 2 files are as follows:
512	>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
513	1. ia-web-commons/src/main/java/org/archive/extract/WATExtractorOutput.java
514
515	[diff src/main/java/org/archive/extract/WATExtractorOutput.orig src/main/java/org/archive/extract/WATExtractorOutput.java]
516
517	162,163c162,163
518	< targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Filename");
519	< } else {
520	---
521	> targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Warcinfo-ID");
522	> } else {
523
524
525	2. ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java
526
527	[diff src/main/java/org/archive/server/GZRangeClient.orig src/main/java/org/archive/server/GZRangeClient.java]
528
529	76,83c76,82
530	< "WARC/1.0\r\n" +
531	< "WARC-Type: warcinfo\r\n" +
532	< "WARC-Date: %s\r\n" +
533	< "WARC-Filename: %s\r\n" +
534	< "WARC-Record-ID: <urn:uuid:%s>\r\n" +
535	< "Content-Type: application/warc-fields\r\n" +
536	< "Content-Length: %d\r\n\r\n";
537	<
538	---
539	> "WARC/1.0\r\n" +
540	> "Content-Type: application/warc-fields\r\n" +
541	> "WARC-Type: warcinfo\r\n" +
542	> "WARC-Warcinfo-ID: <urn:uuid:%s>\r\n" +
543	> "Content-Length: %d\r\n\r\n" +
544	> "WARC-Record-ID: <urn:uuid:%s>\r\n" +
545	> "WARC-Date: %s\r\n";
546	115,119c114,119
547	< private static String DEFAULT_WARC_PATTERN = "software: %s Extractor\r\n" +
548	< "format: WARC File Format 1.0\r\n" +
549	< "conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n" +
550	< "publisher: Internet Archive\r\n" +
551	< "created: %s\r\n\r\n";
552	---
553	> private static String DEFAULT_WARC_PATTERN = "Software: crawl/1.0\r\n" +
554	> "Format: WARC File Format 1.0\r\n" +
555	> "Conformsto: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n\r\n";
556	> // +
557	> //"publisher: Internet Archive\r\n" +
558	> //"created: %s\r\n\r\n";
559	<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
560
561
562	3. To run WARC to WET, the warc needs to live on hdfs in a warc folder and there should be wet and wat folders at the same level.
563
564	For example, assume that running Autistici's crawl generated $GOPATH/bin/crawl.warc.gz
565	(default location and filename unless you pass flags to crawl CLI to control these)
566
567	a. Ensure you get crawl.warc.gz onto the vagrant VM that has the git projects for WARC-to-WET installed, recompiled with the above modifications.
568
569	b. Now, create the folder structure needed for warc-to-wet conversion:
570	hdfs dfs -mkdir /user/vagrant/warctest
571	hdfs dfs -mkdir /user/vagrant/warctest/warc
572	hdfs dfs -mkdir /user/vagrant/warctest/wet
573	hdfs dfs -mkdir /user/vagrant/warctest/wat
574
575	c. Put crawl.warc.gz into the warc folder on hfds:
576	hdfs dfs -put crawl.warc.gz /user/vagrant/warctest/warc/.
577
578	d. Finally, time to run the actual warc-to-wet conversion from ia-hadoop-tools:
579	cd ia-hadoop-tools
580	WARC_FOLDER=/user/vagrant/warctest/warc
581	$HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz $WARC_FOLDER/crawl*.warc.gz
582
583	More meaningful when the WARC_FOLDER contains multiple *.warc.gz files,
584	as the above will use map-reduce to generate a .warc.wet.gz file in the output wet folder for each input .warc.gz file.
585
586	e. Copy the generated wet files across from /user/vagrant/warctest/wet/:
587
588	(cd /vagrant or else
589	cd /home/vagrant
590	)
591	hdfs dfs -get /user/vagrant/warctest/wet/crawl.warc.wet.gz .
592
593	or, when dealing with multiple input warc files, we'll have multiple wet files:
594	hdfs dfs -get /user/vagrant/warctest/wet/*.warc.wet.gz
595
596
597	f. Now can view the contents of the WET files to confirm they are what we want:
598	gunzip crawl.warc.wet.gz
599	zless crawl.warc.wet
600
601	The wet file contents should look good now: the web pages as WET records without html tags.
602
603
604	----------------------------------------------------
605	I. Setting up Nutch v2 on its own Vagrant VM machine
606	----------------------------------------------------
607	1. Untar vagrant-for-nutch2.tar.gz
608	2. Follow the instructions below. A copy is also in vagrant-for-nutch2/GS_README.txt
609
610	---
611	REASONING FOR THE NUTCH v2 SPECIFIC VAGRANT VM:
612	---
613	We were able to get nutch v1 working on a regular machine.
614
615	From a few pages online starting with https://stackoverflow.com/questions/33354460/nutch-clone-website, it appeared that "./bin/nutch fetch -all" was the nutch command to mirror a web site. Nutch v2 introduced the -all flag to the ./bin/nutch fetch command. And nutch v2 required HBase which presupposes hadoop.
616
617	Our vagrant VM for commoncrawl had an incompatible version of HBase but this version was needed for that VM's version of hadoop and spark. So Dr Bainbridge came up with the idea of having a separate Vagrant VM for Nutch v2 which would have the version of HBase it needed and a Hadoop version matching that. Compatible versions with nutch 2.3.1 are mentioned at https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
618	(Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.)
619
620	---
621	Vagrant VM for Nutch2
622	---
623	This vagrant virtual machine is based on https://github.com/martinprobson/vagrant-hadoop-hive-spark
624
625	However:
626	- It comes with the older versions of hadoop 2.5.2 and hbase 0.98.21, and no spark or hive or other packages.
627	- the VM is called node2 with IP 10.211.55.102 (instead of node1 with IP 10.211.55.101)
628	- Since not all packages are installed, fewer ports needed forwarding. And they're forwarded to portnumber+2 to not conflict with any vagrant VM that used the original vagrant image's forwarded port numbers.
629	- scripts/common.sh uses HBASE_ARCHIVE of the form "hbase-${HBASE_VERSION}-hadoop2-bin.tar.gz" (the -hadoop2 suffix is specific to v0.98.21)
630	- and hbase gets installed as /usr/local/hbase-$HBASE_VERSION-hadoop2 (again with the additional -hadoop2 suffix specific to v0.98.21) in scripts/setup-hbase.sh, so the symbolic link creation there needed to refer to a path of this form.
631
632	INSTRUCTIONS:
633	a. mostly follow the "Getting Started" instructions at https://github.com/martinprobson/vagrant-hadoop-hive-spark
634	b. but after step 3, replace the github cloned Vagrantfile, scripts and resources folders with their modified counterparts included in the zip file that can be downloaded by visiting http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz.
635	c. wherever the rest of that git page refers to "node1", IP "10.211.55.101" and specific port numbers, use instead "node2", IP "10.211.55.102" and the forwarded port numbers in the customised Vagrantfile.
636	If there's already a node2/if IP "10.211.55.102" vagrant VM set up, then adjust all files in the git cloned vagrant vm already modified by the contents of this vagrant-for-nutch2 folder as follows:
637	- increment all occurrences of node2 and "10.211.55.102" to node3 and IP "10.211.55.103", if not already taken, and
638	- in the Vagrantfile increment forwarded ports by another 2 or so from the highest port number values already in use by other vagrant VMs.
639	d. After doing "vagrant up --provider=virtualbox" to create the VM, do "vagrant ssh" or "vagrant ssh node<#>" (e.g. "vagrant ssh node2") for step 8 in the "Getting Started" section.
640	e. Inside the VM, install emacs, maven, firefox:
641
642	sudo apt-get install emacs
643
644	sudo apt update
645	sudo apt install maven
646
647	sudo apt-get -y install firefox
648
649	f. We set up nutch 2.3.1, which can be downloaded from https://archive.apache.org/dist/nutch/2.3.1/, as that version worked as per the nutch2 tutorial instructions with the configuration of specific versions of hadoop, hbase and gora for the vagrant VM described here.
650
651	After untarring the nutch 2.3.1 source tarball,
652	1. move apache-nutch-2.3.1/conf/regex-urlfilter.txt to apache-nutch-2.3.1/conf/regex-urlfilter.txt.orig
653	2. download the two files nutch-site.xml and regex-urlfilter.GS_TEMPLATE from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/conf
654	and put them into the apache-nutch-2.3.1/conf folder.
655	3. Then continue following the nutch tutorial 2 instructions at https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial to set up nutch2 (and apache-solr, if needed, but I didn't install apache-solr for nutch v2).
656	- nutch-site.xml has already been configured to do as much optimisation and speeding up of the crawling as we know about concerning nutch
657	- for each site that will be crawled, regex-urlfilter.GS_TEMPLATE will get copied as the live "regex-urlfilter.txt" file, and lines of regex filters will be appended to its end.
658
659	------------------------------------------------------------------------
660	J. Automated crawling with Nutch v2.3.1 and post-processing
661	------------------------------------------------------------------------
662	1. When you're ready to start crawling with Nutch 2.3.1,
663	- copy the batchcrawl.sh file (from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts) into the vagrant machine at top level. Make the script executable.
664	- copy the to_crawl.tar.gz file containing the "to_crawl" folder (generated by CCWETProcessor.java running of the common-crawl downloaded data where MRI was the primary language) and put it into the vagrant machine at toplevel.
665	- run batchcrawl.sh on a site or range of sites not yet crawled, e.g.
666	./batchcrawl.sh 00485-00500
667
668	2. When crawling is done, the above will have generated the "crawled" folder containing a subfolder for each of the crawled sites, e.g. subfolders 00485 to 00500. Each crawled site folder will contain a dump.txt with the text output of the site's web pages. The "crawled" folder with site subfolders each containing a dump.txt file can be processed with NutchTextDumpProcessor.java.
669
670
671	------------------------------------------------------------------------
672	K. Sending the crawled data into mongodb with NutchTextDumpProcessor.java
673	------------------------------------------------------------------------
674	1. The crawled folder should contain all the batch crawls done with nutch (section J above).
675
676	2. Set up mongodb connection properties in conf/config.properties
677	By default, the mongodb database name is configured to be ateacrawldata.
678
679	3. Create a mongodb database by the specified name. A database named "ateacrawldata" to be created, unless the default db name is changed.
680
681	4. Set up the environment and compile NutchTextDumpProcessor:
682	cd maori-lang-detection/apache-opennlp-1.9.1
683	export OPENNLP_HOME=`pwd`
684	cd maori-lang-detection/src
685
686	javac -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpToMongoDB.java
687
688	4. Pass the crawled folder to NutchTextDumpProcessor:
689	java -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpToMongoDB /PATH/TO/crawled
690
691	5. It may take 1.5 hours or so to ingest the approximately 1450 crawled sites' data into mongodb.
692
693	6. Launch the Robo 3T (version 1.3 is one we tested) MongoDB client. Use it to connect to MongoDB's "ateacrawldata" database.
694	Now you can run queries.
695
696	--------------------------------------------------------
697	APPENDIX: Reading data from hbase tables and backing up hbase
698	--------------------------------------------------------
699
700	* Backing up HBase database:
701	https://blogs.msdn.microsoft.com/data_otaku/2016/12/21/working-with-the-hbase-import-and-export-utility/
702
703	* From an image at http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/
704	to see the contents of a table, inside hbase shell, type:
705
706	scan 'tablename'
707
708	e.g. scan '01066_webpage' and hit enter.
709
710
711	To list tables and see their "column families" (I don't yet understand what this is):
712
713	hbase shell
714	hbase(main):001:0> list
715
716	hbase(main):002:0> describe '01066_webpage'
717	Table 01066_webpage is ENABLED
718	01066_webpage
719	COLUMN FAMILIES DESCRIPTION
720	{NAME => 'f', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK
721	CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
722	{NAME => 'h', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK
723	CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
724	{NAME => 'il', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC
725	KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
726	{NAME => 'mk', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC
727	KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
728	{NAME => 'mtdt', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BL
729	OCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
730	{NAME => 'ol', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC
731	KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
732	{NAME => 'p', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK
733	CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
734	{NAME => 's', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK
735	CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
736	8 row(s) in 0.1180 seconds
737
738
739	-----------------------EOF------------------------
740

Note: See TracBrowser for help on using the repository browser.

Download in other formats: