Context Navigation

GS_README.TXT@ 33545

Last change on this file since 33545 was 33545, checked in by ak19, 5 years ago

Mainly changes to crawling-Nutch.txt and some minor changes to other txt files. crawling-Nutch.txt now documents my attempts to successfully run nutch v2 on the davidb homepage site and crawl it entirely and dump the text output into the local or hadoop filesystem. I also ran 2 different numbers of nutch cycles (generate-fetch-parse-updatedb) to download the site: 10 cycles and 15 cycles. I paid attention to the output the second time, it stopped after 6 cycles saying there was nothing new to fetch. So it seems to have a built-in termination test, allowing site mirroring. Running readdb with the -stats flag allowed me to check that both times, it downloaded 44 URLs.

File size: 30.6 KB

Line
1	----------------------------------------
2	INDEX: follow in sequence
3	----------------------------------------
4	A. VAGRANT VM WITH HADOOP AND SPARK
5	B. Create IAM role on Amazon AWS to use S3a
6	C. Configure Spark on your vagrant VM with the AWS authentication details
7	---
8	Script scripts/setup.sh now is automated to do the steps in D-F below
9	and prints out the main instruction for G.
10	---
11	D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
12	E. Setup cc-index-table git project
13	F. Setup warc-to-wet tools (git projects)
14	G. Getting and running our scripts
15	---
16	H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
17	I. Setting up Nutch v2 on its own Vagrant VM machine
18
19	----------------------------------------
20
21	----------------------------------------
22	A. VAGRANT VM WITH HADOOP AND SPARK
23	----------------------------------------
24	Set up vagrant with hadoop and spark as follows
25
26	1. by following the instructions at
27	https://github.com/martinprobson/vagrant-hadoop-hive-spark
28
29	This will eventually create the following folder, which will contain Vagrantfile
30	/home/<USER>/vagrant-hadoop-hive-spark
31
32	2. If there are other vagrant VMs set up according to the same instructions on the same machine, then need to change the forwarded ports (the 2nd column of ports) in the file "Vagrantfile". In the example below, excerpted from my Vagrantfile, I've incremented the forwarded ports by 1:
33
34	config.vm.network "forwarded_port", guest: 8080, host: 8081
35	config.vm.network "forwarded_port", guest: 8088, host: 8089
36	config.vm.network "forwarded_port", guest: 9083, host: 9084
37	config.vm.network "forwarded_port", guest: 4040, host: 4041
38	config.vm.network "forwarded_port", guest: 18888, host: 18889
39	config.vm.network "forwarded_port", guest: 16010, host: 16011
40
41	Remember to visit the adjusted ports on the running VM.
42
43	3. The most useful vagrant commands:
44	vagrant up # start up the vagrant VM if not already running.
45	# May need to provide VM's ID if there's more than one vagrant VM
46	ssh vagrant # ssh into the sole vagrant VM, else may need to provide vagrant VM's ID
47
48	vagrant halt # to shutdown the vagrant VM. Provide VM's ID if there's more than one vagrant VM.
49
50	(vagrant destroy) # to get rid of your vagrant VM. Useful if you've edited your Vagrantfile
51
52
53	4. Inside the VM, /home/<USER>/vagrant-hadoop-hive-spark will be shared and mounted as /vagrant
54	Remember, this is the folder containing Vagrantfile. It's easy to use the shared folder to transfer files between the VM and the actual machine that hosts it.
55
56	5. Install EMACS, FIREFOX AND MAVEN on the vagrant VM:
57	Start up vagrant machine ("vagrant up") and ssh into it ("ssh vagrant") if you haven't already.
58
59
60	a. sudo apt-get -y install firefox
61
62	b. sudo apt-get install emacs
63
64	c. sudo apt-get install maven
65	(or sudo apt update
66	sudo apt install maven)
67
68	Maven is needed for the commoncrawl github projects we'll be working with.
69
70
71	6. Although you can edit the Vagrantfile to have emacs and maven automatically installed when the vagrant VM is created, for firefox, you're advised to install it as above.
72
73	To be able to view firefox from the machine hosting the VM, use a separate terminal and run:
74	vagrant ssh -- -Y
75	[or "vagrant ssh -- -Y node1", if VM ID is node1]
76
77	READING ON Vagrant:
78	* Guide: https://www.vagrantup.com/intro/getting-started/index.html
79	* Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
80	* vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
81	* https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
82	* https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
83	sudo apt-get -y install firefox
84	* vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
85
86	* hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
87	* https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
88
89	-------------------------------------------------
90	B. Create IAM role on Amazon AWS to use S3 (S3a)
91	-------------------------------------------------
92	CommonCrawl (CC) crawl data is stored on Amazon S3, specifically the newest version Amazon s3a which has superceded both s3 and its earlier successor s3n.
93
94	In order to have access to cc crawl data, need to create an IAM role on Dr Bainbridge's Amazon AWS account and configure its profile for commoncrawl.
95
96	1. Log into Dr Bainbridge's Amazon AWS account
97	- In the aws management console:
98	[email protected]
99	lab pwd, capital R and ! (maybe g)
100
101
102	2. Create a new "iam" role or user for "commoncrawl(er)" profile
103
104	3. You can create the commoncrawl profile while creating the user/role, by following the instructions at https://answers.dataiku.com/1734/common-crawl-s3
105	which states
106
107	"Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user"
108
109	#### START POLICY IN JSON FORMAT ###
110	{
111	"Version": "2012-10-17",
112	"Statement": [
113	{
114	"Sid": "Stmt1503647467000",
115	"Effect": "Allow",
116	"Action": [
117	"s3:GetObject",
118	"s3:ListBucket"
119	],
120	"Resource": [
121	"arn:aws:s3:::commoncrawl/*",
122	"arn:aws:s3:::commoncrawl"
123	]
124	}
125	]
126	}
127	#### END POLICY ###
128
129
130	--------------------------------------------------------------------------
131	C. Configure Spark on your vagrant VM with the AWS authentication details
132	--------------------------------------------------------------------------
133	Any Spark jobs run against the CommonCrawl data stored on Amazon s3a need to be able to authenticate with the AWS IAM role you created above. In order to do this, you'll want to put the Amazon AWS access key and secret key in the SPARK configuration properties file. (Instead of configuring these values in hadoop's core-site.xml, as in the latter case, the authentication details don't get copied across when distributed jobs are run to other computers in the distributed cluster that also need to know how to authenticate):
134
135	1. Inside the vagrant vm:
136
137	sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
138	(sudo emacs $SPARK_HOME/conf/spark-defaults.conf)
139
140	2. Edit the spark properties conf file to contain these 3 new properties:
141
142	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
143	spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY_HERE
144	spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY_HERE
145
146	Instructions on which properties to set were taken from:
147	- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
148	- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
149
150	[NOTE, inactive alternative: Instead of editing spark's config file to set these properties, these properties can also be set in the bash script that executes the commoncrawl Spark jobs:
151
152	$SPARK_HOME/bin/spark-submit \
153	...
154	--conf spark.hadoop.fs.s3a.access.key=ACCESSKEY \
155	--conf spark.hadoop.fs.s3a.secret.key=SECRETKEY \
156	--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
157	...
158
159	But better not to hardcode authentication details into code, so I did it the first way.
160	]
161
162	----------------------------------------------------------------------
163	NOTE:
164	Script scripts/setup.sh now is automated to do the steps in D-F below
165	and prints out the main instruction for G.
166
167
168	----------------------------------------------------------------------
169	D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
170	----------------------------------------------------------------------
171	The following 2 pages state that additional steps are necessary to get hadoop and spark to work with AWS S3a:
172
173	- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
174	- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
175
176	I'm not sure whether these steps were really necessary in my case, and if so, whether it was A or B below that got things working for me. However, I have both A and B below set up.
177
178
179	A. Check your maven installation for necessary jars:
180
181	1. Installing maven may already have got the specifically recommended version of AWS-Java-SDK (aws-java-sdk-1.7.4.jar) and v2.7.6 hadoop-aws matching the vagrant VM's hadoop version (hadoop-aws-2.7.6.jar). Check these locations, as that's where I have them:
182	- /home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
183	- /home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar
184
185	The specifically recommended v.1.7.4 from the instructions can be found off https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 at https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
186
187	2. The script that runs the 2 Spark jobs uses the above paths for one of the spark jobs:
188	$SPARK_HOME/bin/spark-submit \
189	--jars file:/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar,file:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
190	--driver-class-path=/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
191
192	However the other Spark job in the script does not set --jars or --driver-class-path, despite also referring to the s3a://commoncrawl table. So I'm not sure whether the jars are necessary or whether theywere just being ignored when provided.
193
194	B. Download jar files and put them on the hadoop classpath:
195
196	1. download the jar files:
197	- I obtained aws-java-sdk-1.11.616.jar (v1.11) from https://aws.amazon.com/sdk-for-java/
198
199	- I downloaded hadoop-aws 2.7.6 jar, as it goes with my version of hadoop, from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
200
201	2. The easiest solution is to copy the 2 downloaded jars onto a location in the hadoop classpath.
202
203	a. The command that shows the paths present on the Hadoop CLASSPATH:
204	hadoop classpath
205	One of the paths this will list is /usr/local/hadoop-2.7.6/share/hadoop/common/
206
207	b. SUDO COPY the 2 downloaded jar files, hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar, to this location:
208
209	sudo cp hadoop-aws-2.7.6.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
210	sudo cp aws-java-sdk-1.11.616.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
211
212	Any hadoop jobs run will now find these 2 jar files on the classpath.
213
214	[NOTE, unused alternative: Instead of copying the 2 jar files into a system location, assuming they were downloaded into /home/vagrant/lib, you can also export a custom folder's jar files into the hadoop classpath from the bash script that runs the spark jobs. This had no effect for me, and was commented out, and is another reason why I'm not sure if the 2 jar files were even necessary.
215	#export LIBJARS=/home/vagrant/lib/*
216	#export HADOOP_CLASSPATH=`echo ${LIBJARS} \| sed s/,/:/g`
217	]
218
219
220	------------------------------------
221	E. Setup cc-index-table git project
222	------------------------------------
223	Need to be inside the vagrant VM.
224
225	1. Since you should have already installed maven, you can checkout and compile the cc-index-table git project.
226
227	git clone https://github.com/commoncrawl/cc-index-table.git
228
229	2. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below:
230
231	17c17,18
232	< <spark.version>2.4.1</spark.version>
233	---
234	> <!--<spark.version>2.4.1</spark.version>-->
235	> <spark.version>2.3.0</spark.version>
236	135a137,143
237	> <dependency>
238	> <groupId>org.apache.hadoop</groupId>
239	> <artifactId>hadoop-aws</artifactId>
240	> <version>2.7.6</version>
241	> </dependency>
242	>
243
244	3. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
245
246	a. Set option(header) to false, since the csv file contains no header row, only data rows.
247	Change:
248	sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
249	.load(csvQueryResult);
250	To
251	sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
252	.load(csvQueryResult);
253
254	b. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
255	Comment out:
256	//JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
257	.toJavaRDD();
258	Replace with the default inferred column names:
259	JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
260	.toJavaRDD();
261
262	// TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here.
263
264	4. Now (re)compile cc-index-table with the above modifications:
265
266	cd cc-index-table
267	mvn package
268
269	-------------------------------
270	F. Setup warc-to-wet tools
271	-------------------------------
272	To convert WARC files to WET (.warc.wet) files, need to checkout, set up and compile a couple more tools. These instructions are derived from those at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
273
274	1. Grab and compile the 2 git projects for converting warc to wet:
275	git clone https://github.com/commoncrawl/ia-web-commons
276	cd ia-web-commons
277	mvn install
278
279	git clone https://github.com/commoncrawl/ia-hadoop-tools
280	cd ia-hadoop-tools
281	# can't compile this yet
282
283
284	2. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json):
285
286	<dependency>
287	<groupId>org.json</groupId>
288	<artifactId>json</artifactId>
289	<version>20131018</version>
290	</dependency>
291
292	[
293	UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above:
294	a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml:
295	ia-hadoop-tools>diff pom.xml.orig pom.xml
296
297	< <groupId>org.netpreserve.commons</groupId>
298	< <artifactId>webarchive-commons</artifactId>
299	< <version>1.1.1-SNAPSHOT</version>
300	---
301	> <groupId>org.commoncrawl</groupId>
302	> <artifactId>ia-web-commons</artifactId>
303	> <version>1.1.9-SNAPSHOT</version>
304
305	b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
306
307	However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period.
308
309	ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/
310	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ
311	Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java
312	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ
313	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ
314	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ
315	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ
316	Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java
317	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ
318	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ
319	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ
320	Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ
321	Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ
322	Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ
323	]
324
325	3. Now can compile ia-hadoop-tools:
326	cd ia-hadoop-tools
327	mvn package
328
329	4. Can't run it until guava.jar is on hadoop classpath. Locate a guava.jar and put it into an existing location checked for by hadoop classpath:
330
331	locate guava.jar
332	# found in /usr/share/java/guava.jar and /usr/share/maven/lib/guava.jar
333	diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
334	# identical/no difference, so can use either
335	sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
336	# now guava.jar has been copied into a location on hadoop classpath
337
338
339	Having done the above, our bash script will now be able to convert WARC to WET files when it runs:
340	$HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/PATH/TO/warc/*.warc.gz
341	Our script expects a specific folder structure: there should be a "warc" folder (containing the warc files), which is supplied as above, but also an empty "wet" and "wat" folder at the same level as the "warc" folder.
342
343
344	When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
345
346	-----------------------------------
347	G. Getting and running our scripts
348	-----------------------------------
349
350	1. Grab our 1st bash script and put it into the /home/vagrant/cc-index-table/src/script:
351	cd cc-index-table/src/script
352	wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_maori_WET_records_for_crawl.sh
353	chmod u+x get_maori_WET_records_for_crawl.sh
354
355	RUN AS:
356	cd cc-index-table
357	./src/script/get_maori_WET_records_for_crawl.sh <crawl-timestamp>
358	where crawl-timestamp of form "CC-MAIN-YYYY-##" >= September 2019
359
360	OUTPUT:
361	After hours of processing (leave it to run overnight), you should end up with:
362	hdfs dfs -ls /user/vagrant/<crawl-timestamp>
363	In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/
364	that we want would have been copied into /vagrant/<crawl-timestamp>-wet-files/
365
366
367	The script get_maori_WET_records_for_crawl.sh
368	- takes a crawl timestamp of the form "CC-MAIN-YYYY-##" from Sep 2018 onwards (before which content_languages were not indexed). The legitimate crawl timestampts are listed in the first column at http://index.commoncrawl.org/
369	- runs a spark job against CC's AWS bucket over s3a to create a csv table of MRI language records
370	- runs a spark job to download all the WARC records from CC's AWS that are denoted by the csv file's records into zipped warc files
371	- converts WARC to WET: locally converts the downloaded warc.gz files into warc.wet.gz (and warc.wat.gz) files
372
373
374	2. Grab our 2nd bash script and put it into the top level of cc-index-table (/home/vagrant/cc-index/table):
375
376	cd cc-index-table
377	wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_Maori_WET_records_from_CCSep2018_on.sh
378	chmod u+x get_Maori_WET_records_from_CCSep2018_on.sh
379
380	RUN FROM cc-index-table DIRECTORY AS:
381	(cd cc-index-table)
382	./get_Maori_WET_records_from_CCSep2018_on.sh
383
384	This script just runs the 1st script cc-index-table/src/script/get_maori_WET_records_for_crawl.sh (above) to process all listed common-crawls since September 2018.
385	If any fails, then the script will terminate. Else it runs against each common-crawl in sequence.
386
387	NOTE: If needed, update the script with more recent crawl timestamps from http://index.commoncrawl.org/
388
389	OUTPUT:
390	After days of running, will end up with:
391	hdfs:///user/vagrant/<crawl-timestamp>/wet/
392	for each crawl-timestamp listed in the script,
393	which at present would have got copied into
394	/vagrant/<crawl-timestamp>-wet-files/
395
396	Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
397
398	-----------------------------------
399	H. Austici crawl
400	-----------------------------------
401	Austici's crawl: CLI to download web sites as WARCs, features basics to avoid crawler taps.
402
403	Out of several software to do site mirroring, Autistici's "crawl" seemed promising:
404	https://anarc.at/services/archive/web/
405
406	- CLI.
407	- Can download a website quite simply, though flags for additional settings are available.
408	- Coded to prevent common traps.
409	- Downloads website as WARC file
410	- Now I have the WARC to WET process working for the WARC file it produced for the usual test site (Dr Bainbridge's home page)
411
412	Need to have Go installed in order to install and run Autistici's crawl.
413	Not a problem, because I can do it on the remote machine (which also hosts the hdfs) where I have sudo powers.
414
415	INSTRUCTIONS
416
417	1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
418	2. Create go environment:
419	#!/bin/bash
420	# environment vars for golang
421	export GOROOT=/usr/local/go
422	export GOPATH=$HOME/go
423	export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
424	3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
425
426	These steps work:
427
428	cd $GOPATH
429	mkdir bin
430	mkdir src
431	cd src
432
433	4. Since trying to go install the crawl url didn't work
434	https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
435	[https://stackoverflow.com/questions/26694271/go-install-doesnt-create-any-bin-file]
436
437	vagrant@node2:~/go/src$
438	mkdir -p git.autistici.org/ale
439	cd git.autistici.org/ale
440	git clone https://git.autistici.org/ale/crawl.git
441
442	[Now can run the install command in README.md:]
443	cd $GOPATH/src
444	go install git.autistici.org/ale/crawl/cmd/crawl
445
446	Now we should have a $GOPATH/bin folder containing the "crawl" binary
447
448	5. Run a crawl:
449	cd $GOPATH/bin
450	./crawl https://www.cs.waikato.ac.nz/~davidb/
451
452	which downloads the site and puts the warc file into the $GOPATH/bin folder.
453
454	More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
455
456	6. To view the RAW contents of a WARC file:
457	https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
458
459	zless <warc-file-name>
460
461	zless already installed on vagrant file
462
463
464	-----------------------------------------------------------------------------------------------
465	How to run warc-to-wet conversion on sites downloaded as WARCs by Austici's "crawl"
466	-----------------------------------------------------------------------------------------------
467	ISSUES CONVERTING WARC to WET:
468	---
469	WARC files produced by Autistici crawl are of a somewhat different format to CommonCrawl WARCs.
470	- missing elements in header
471	- different header elements
472	- ordering different (if that matters)
473
474	But WET is an official format, not CommonCrawl specific, as indicated by
475
476	https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
477	"WET (parsed text)
478
479	WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format."
480
481	So must be possible to get WARC to WET conversion used for CommonCrawl data to work on Autistici crawl's WARC files.
482
483
484	RESOLUTION:
485	---
486	I made changes to 2 java source files in the 2 github projects ia-web-commons and ia-hadoop-tools, which we use for the WARC to WET processing of CommonCrawl data. These gitprojects (with modifications for commoncrawl) are already on http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/gitprojects.
487
488	The changed files are as follows:
489	1. patches/WATExtractorOutput.java
490	put into ia-web-commons/src/main/java/org/archive/extract
491	after renaming existing to .orig
492
493	THEN RECOMPILE ia-web-commons with:
494	mvn install
495
496	2. patches/GZRangeClient.java
497	put into ia-hadoop-tools/src/main/java/org/archive/server
498	after renaming existing to .orig
499
500	THEN RECOMPILE ia-hadoop-tools with:
501	mvn package
502
503	Make sure to first compile ia-web-commons, then ia-hadoop-tools.
504
505
506	The modifications made to the above 2 files are as follows:
507	>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
508	1. ia-web-commons/src/main/java/org/archive/extract/WATExtractorOutput.java
509
510	[diff src/main/java/org/archive/extract/WATExtractorOutput.orig src/main/java/org/archive/extract/WATExtractorOutput.java]
511
512	162,163c162,163
513	< targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Filename");
514	< } else {
515	---
516	> targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Warcinfo-ID");
517	> } else {
518
519
520	2. ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java
521
522	[diff src/main/java/org/archive/server/GZRangeClient.orig src/main/java/org/archive/server/GZRangeClient.java]
523
524	76,83c76,82
525	< "WARC/1.0\r\n" +
526	< "WARC-Type: warcinfo\r\n" +
527	< "WARC-Date: %s\r\n" +
528	< "WARC-Filename: %s\r\n" +
529	< "WARC-Record-ID: <urn:uuid:%s>\r\n" +
530	< "Content-Type: application/warc-fields\r\n" +
531	< "Content-Length: %d\r\n\r\n";
532	<
533	---
534	> "WARC/1.0\r\n" +
535	> "Content-Type: application/warc-fields\r\n" +
536	> "WARC-Type: warcinfo\r\n" +
537	> "WARC-Warcinfo-ID: <urn:uuid:%s>\r\n" +
538	> "Content-Length: %d\r\n\r\n" +
539	> "WARC-Record-ID: <urn:uuid:%s>\r\n" +
540	> "WARC-Date: %s\r\n";
541	115,119c114,119
542	< private static String DEFAULT_WARC_PATTERN = "software: %s Extractor\r\n" +
543	< "format: WARC File Format 1.0\r\n" +
544	< "conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n" +
545	< "publisher: Internet Archive\r\n" +
546	< "created: %s\r\n\r\n";
547	---
548	> private static String DEFAULT_WARC_PATTERN = "Software: crawl/1.0\r\n" +
549	> "Format: WARC File Format 1.0\r\n" +
550	> "Conformsto: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n\r\n";
551	> // +
552	> //"publisher: Internet Archive\r\n" +
553	> //"created: %s\r\n\r\n";
554	<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
555
556
557	3. To run WARC to WET, the warc needs to live on hdfs in a warc folder and there should be wet and wat folders at the same level.
558
559	For example, assume that running Autistici's crawl generated $GOPATH/bin/crawl.warc.gz
560	(default location and filename unless you pass flags to crawl CLI to control these)
561
562	a. Ensure you get crawl.warc.gz onto the vagrant VM that has the git projects for WARC-to-WET installed, recompiled with the above modifications.
563
564	b. Now, create the folder structure needed for warc-to-wet conversion:
565	hdfs dfs -mkdir /user/vagrant/warctest
566	hdfs dfs -mkdir /user/vagrant/warctest/warc
567	hdfs dfs -mkdir /user/vagrant/warctest/wet
568	hdfs dfs -mkdir /user/vagrant/warctest/wat
569
570	c. Put crawl.warc.gz into the warc folder on hfds:
571	hdfs dfs -put crawl.warc.gz /user/vagrant/warctest/warc/.
572
573	d. Finally, time to run the actual warc-to-wet conversion from ia-hadoop-tools:
574	cd ia-hadoop-tools
575	WARC_FOLDER=/user/vagrant/warctest/warc
576	$HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz $WARC_FOLDER/crawl*.warc.gz
577
578	More meaningful when the WARC_FOLDER contains multiple *.warc.gz files,
579	as the above will use map-reduce to generate a .warc.wet.gz file in the output wet folder for each input .warc.gz file.
580
581	e. Copy the generated wet files across from /user/vagrant/warctest/wet/:
582
583	(cd /vagrant or else
584	cd /home/vagrant
585	)
586	hdfs dfs -get /user/vagrant/warctest/wet/crawl.warc.wet.gz .
587
588	or, when dealing with multiple input warc files, we'll have multiple wet files:
589	hdfs dfs -get /user/vagrant/warctest/wet/*.warc.wet.gz
590
591
592	f. Now can view the contents of the WET files to confirm they are what we want:
593	gunzip crawl.warc.wet.gz
594	zless crawl.warc.wet
595
596	The wet file contents should look good now: the web pages as WET records without html tags.
597
598
599	----------------------------------------------------
600	I. Setting up Nutch v2 on its own Vagrant VM machine
601	----------------------------------------------------
602	1. Untar vagrant-for-nutch2.tar.gz
603	2. Follow the instructions in vagrant-for-nutch2/GS_README.txt
604
605	---
606	REASONING FOR THE NUTCH v2 SPECIFIC VAGRANT VM:
607	---
608	We were able to get nutch v1 working on a regular machine.
609
610	From a few pages online starting with https://stackoverflow.com/questions/33354460/nutch-clone-website, it appeared that "./bin/nutch fetch -all" was the nutch command to mirror a web site. Nutch v2 introduced the -all flag to the ./bin/nutch fetch command. And nutch v2 required HBase which presupposes hadoop.
611
612	Our vagrant VM for commoncrawl had an incompatible version of HBase but this version was needed for that VM's version of hadoop and spark. So Dr Bainbridge came up with the idea of having a separate Vagrant VM for Nutch v2 which would have the version of HBase it needed and a Hadoop version matching that. Compatible versions with nutch 2.3.1 are mentioned at https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
613	(Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.)
614
615
616
617	-----------------------EOF------------------------
618

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT@ 33545

Download in other formats: