Context Navigation

GS_README.TXT@ 33541

Last change on this file since 33541 was 33541, checked in by ak19, 5 years ago

hdfs-cc-work/GS_README.txt now contains the complete instructions to use Autistici crawl to download a website (as WARC file) as well as now also the instructions to convert those WARCs to WET. 2. Moved the first part out of MoreReading/crawling-Nutch.txt. 3. Adding patched WARC-to-WET files for the gitprojects ia-web-commons and ia-hadoop-tools to successfully do the WARC-to-WET processing on WARC files generated by Austistici crawl. (Worked on Dr Bainbridge's home page site as a test. Not tried any other site yet, as I wanted to get the work flow from crawl to WET working.)

File size: 29.1 KB

Line
1	----------------------------------------
2	INDEX: follow in sequence
3	----------------------------------------
4	A. VAGRANT VM WITH HADOOP AND SPARK
5	B. Create IAM role on Amazon AWS to use S3a
6	C. Configure Spark on your vagrant VM with the AWS authentication details
7	---
8	Script scripts/setup.sh now is automated to do the steps in D-F below
9	and prints out the main instruction for G.
10	---
11	D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
12	E. Setup cc-index-table git project
13	F. Setup warc-to-wet tools (git projects)
14	G. Getting and running our scripts
15	---
16	H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
17
18	----------------------------------------
19
20	----------------------------------------
21	A. VAGRANT VM WITH HADOOP AND SPARK
22	----------------------------------------
23	Set up vagrant with hadoop and spark as follows
24
25	1. by following the instructions at
26	https://github.com/martinprobson/vagrant-hadoop-hive-spark
27
28	This will eventually create the following folder, which will contain Vagrantfile
29	/home/<USER>/vagrant-hadoop-hive-spark
30
31	2. If there are other vagrant VMs set up according to the same instructions on the same machine, then need to change the forwarded ports (the 2nd column of ports) in the file "Vagrantfile". In the example below, excerpted from my Vagrantfile, I've incremented the forwarded ports by 1:
32
33	config.vm.network "forwarded_port", guest: 8080, host: 8081
34	config.vm.network "forwarded_port", guest: 8088, host: 8089
35	config.vm.network "forwarded_port", guest: 9083, host: 9084
36	config.vm.network "forwarded_port", guest: 4040, host: 4041
37	config.vm.network "forwarded_port", guest: 18888, host: 18889
38	config.vm.network "forwarded_port", guest: 16010, host: 16011
39
40	Remember to visit the adjusted ports on the running VM.
41
42	3. The most useful vagrant commands:
43	vagrant up # start up the vagrant VM if not already running.
44	# May need to provide VM's ID if there's more than one vagrant VM
45	ssh vagrant # ssh into the sole vagrant VM, else may need to provide vagrant VM's ID
46
47	vagrant halt # to shutdown the vagrant VM. Provide VM's ID if there's more than one vagrant VM.
48
49	(vagrant destroy) # to get rid of your vagrant VM. Useful if you've edited your Vagrantfile
50
51
52	4. Inside the VM, /home/<USER>/vagrant-hadoop-hive-spark will be shared and mounted as /vagrant
53	Remember, this is the folder containing Vagrantfile. It's easy to use the shared folder to transfer files between the VM and the actual machine that hosts it.
54
55	5. Install EMACS, FIREFOX AND MAVEN on the vagrant VM:
56	Start up vagrant machine ("vagrant up") and ssh into it ("ssh vagrant") if you haven't already.
57
58
59	a. sudo apt-get -y install firefox
60
61	b. sudo apt-get install emacs
62
63	c. sudo apt-get install maven
64	(or sudo apt update
65	sudo apt install maven)
66
67	Maven is needed for the commoncrawl github projects we'll be working with.
68
69
70	6. Although you can edit the Vagrantfile to have emacs and maven automatically installed when the vagrant VM is created, for firefox, you're advised to install it as above.
71
72	To be able to view firefox from the machine hosting the VM, use a separate terminal and run:
73	vagrant ssh -- -Y
74	[or "vagrant ssh -- -Y node1", if VM ID is node1]
75
76	READING ON Vagrant:
77	* Guide: https://www.vagrantup.com/intro/getting-started/index.html
78	* Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
79	* vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
80	* https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
81	* https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
82	sudo apt-get -y install firefox
83	* vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
84
85	* hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
86	* https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
87
88	-------------------------------------------------
89	B. Create IAM role on Amazon AWS to use S3 (S3a)
90	-------------------------------------------------
91	CommonCrawl (CC) crawl data is stored on Amazon S3, specifically the newest version Amazon s3a which has superceded both s3 and its earlier successor s3n.
92
93	In order to have access to cc crawl data, need to create an IAM role on Dr Bainbridge's Amazon AWS account and configure its profile for commoncrawl.
94
95	1. Log into Dr Bainbridge's Amazon AWS account
96	- In the aws management console:
97	[email protected]
98	lab pwd, capital R and ! (maybe g)
99
100
101	2. Create a new "iam" role or user for "commoncrawl(er)" profile
102
103	3. You can create the commoncrawl profile while creating the user/role, by following the instructions at https://answers.dataiku.com/1734/common-crawl-s3
104	which states
105
106	"Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user"
107
108	#### START POLICY IN JSON FORMAT ###
109	{
110	"Version": "2012-10-17",
111	"Statement": [
112	{
113	"Sid": "Stmt1503647467000",
114	"Effect": "Allow",
115	"Action": [
116	"s3:GetObject",
117	"s3:ListBucket"
118	],
119	"Resource": [
120	"arn:aws:s3:::commoncrawl/*",
121	"arn:aws:s3:::commoncrawl"
122	]
123	}
124	]
125	}
126	#### END POLICY ###
127
128
129	--------------------------------------------------------------------------
130	C. Configure Spark on your vagrant VM with the AWS authentication details
131	--------------------------------------------------------------------------
132	Any Spark jobs run against the CommonCrawl data stored on Amazon s3a need to be able to authenticate with the AWS IAM role you created above. In order to do this, you'll want to put the Amazon AWS access key and secret key in the SPARK configuration properties file. (Instead of configuring these values in hadoop's core-site.xml, as in the latter case, the authentication details don't get copied across when distributed jobs are run to other computers in the distributed cluster that also need to know how to authenticate):
133
134	1. Inside the vagrant vm:
135
136	sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
137	(sudo emacs $SPARK_HOME/conf/spark-defaults.conf)
138
139	2. Edit the spark properties conf file to contain these 3 new properties:
140
141	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
142	spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY_HERE
143	spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY_HERE
144
145	Instructions on which properties to set were taken from:
146	- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
147	- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
148
149	[NOTE, inactive alternative: Instead of editing spark's config file to set these properties, these properties can also be set in the bash script that executes the commoncrawl Spark jobs:
150
151	$SPARK_HOME/bin/spark-submit \
152	...
153	--conf spark.hadoop.fs.s3a.access.key=ACCESSKEY \
154	--conf spark.hadoop.fs.s3a.secret.key=SECRETKEY \
155	--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
156	...
157
158	But better not to hardcode authentication details into code, so I did it the first way.
159	]
160
161	----------------------------------------------------------------------
162	NOTE:
163	Script scripts/setup.sh now is automated to do the steps in D-F below
164	and prints out the main instruction for G.
165
166
167	----------------------------------------------------------------------
168	D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
169	----------------------------------------------------------------------
170	The following 2 pages state that additional steps are necessary to get hadoop and spark to work with AWS S3a:
171
172	- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
173	- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
174
175	I'm not sure whether these steps were really necessary in my case, and if so, whether it was A or B below that got things working for me. However, I have both A and B below set up.
176
177
178	A. Check your maven installation for necessary jars:
179
180	1. Installing maven may already have got the specifically recommended version of AWS-Java-SDK (aws-java-sdk-1.7.4.jar) and v2.7.6 hadoop-aws matching the vagrant VM's hadoop version (hadoop-aws-2.7.6.jar). Check these locations, as that's where I have them:
181	- /home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
182	- /home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar
183
184	The specifically recommended v.1.7.4 from the instructions can be found off https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 at https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
185
186	2. The script that runs the 2 Spark jobs uses the above paths for one of the spark jobs:
187	$SPARK_HOME/bin/spark-submit \
188	--jars file:/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar,file:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
189	--driver-class-path=/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
190
191	However the other Spark job in the script does not set --jars or --driver-class-path, despite also referring to the s3a://commoncrawl table. So I'm not sure whether the jars are necessary or whether theywere just being ignored when provided.
192
193	B. Download jar files and put them on the hadoop classpath:
194
195	1. download the jar files:
196	- I obtained aws-java-sdk-1.11.616.jar (v1.11) from https://aws.amazon.com/sdk-for-java/
197
198	- I downloaded hadoop-aws 2.7.6 jar, as it goes with my version of hadoop, from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
199
200	2. The easiest solution is to copy the 2 downloaded jars onto a location in the hadoop classpath.
201
202	a. The command that shows the paths present on the Hadoop CLASSPATH:
203	hadoop classpath
204	One of the paths this will list is /usr/local/hadoop-2.7.6/share/hadoop/common/
205
206	b. SUDO COPY the 2 downloaded jar files, hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar, to this location:
207
208	sudo cp hadoop-aws-2.7.6.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
209	sudo cp aws-java-sdk-1.11.616.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
210
211	Any hadoop jobs run will now find these 2 jar files on the classpath.
212
213	[NOTE, unused alternative: Instead of copying the 2 jar files into a system location, assuming they were downloaded into /home/vagrant/lib, you can also export a custom folder's jar files into the hadoop classpath from the bash script that runs the spark jobs. This had no effect for me, and was commented out, and is another reason why I'm not sure if the 2 jar files were even necessary.
214	#export LIBJARS=/home/vagrant/lib/*
215	#export HADOOP_CLASSPATH=`echo ${LIBJARS} \| sed s/,/:/g`
216	]
217
218
219	------------------------------------
220	E. Setup cc-index-table git project
221	------------------------------------
222	Need to be inside the vagrant VM.
223
224	1. Since you should have already installed maven, you can checkout and compile the cc-index-table git project.
225
226	git clone https://github.com/commoncrawl/cc-index-table.git
227
228	2. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below:
229
230	17c17,18
231	< <spark.version>2.4.1</spark.version>
232	---
233	> <!--<spark.version>2.4.1</spark.version>-->
234	> <spark.version>2.3.0</spark.version>
235	135a137,143
236	> <dependency>
237	> <groupId>org.apache.hadoop</groupId>
238	> <artifactId>hadoop-aws</artifactId>
239	> <version>2.7.6</version>
240	> </dependency>
241	>
242
243	3. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
244
245	a. Set option(header) to false, since the csv file contains no header row, only data rows.
246	Change:
247	sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
248	.load(csvQueryResult);
249	To
250	sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
251	.load(csvQueryResult);
252
253	b. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
254	Comment out:
255	//JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
256	.toJavaRDD();
257	Replace with the default inferred column names:
258	JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
259	.toJavaRDD();
260
261	// TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here.
262
263	4. Now (re)compile cc-index-table with the above modifications:
264
265	cd cc-index-table
266	mvn package
267
268	-------------------------------
269	F. Setup warc-to-wet tools
270	-------------------------------
271	To convert WARC files to WET (.warc.wet) files, need to checkout, set up and compile a couple more tools. These instructions are derived from those at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
272
273	1. Grab and compile the 2 git projects for converting warc to wet:
274	git clone https://github.com/commoncrawl/ia-web-commons
275	cd ia-web-commons
276	mvn install
277
278	git clone https://github.com/commoncrawl/ia-hadoop-tools
279	cd ia-hadoop-tools
280	# can't compile this yet
281
282
283	2. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json):
284
285	<dependency>
286	<groupId>org.json</groupId>
287	<artifactId>json</artifactId>
288	<version>20131018</version>
289	</dependency>
290
291	[
292	UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above:
293	a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml:
294	ia-hadoop-tools>diff pom.xml.orig pom.xml
295
296	< <groupId>org.netpreserve.commons</groupId>
297	< <artifactId>webarchive-commons</artifactId>
298	< <version>1.1.1-SNAPSHOT</version>
299	---
300	> <groupId>org.commoncrawl</groupId>
301	> <artifactId>ia-web-commons</artifactId>
302	> <version>1.1.9-SNAPSHOT</version>
303
304	b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
305
306	However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period.
307
308	ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/
309	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ
310	Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java
311	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ
312	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ
313	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ
314	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ
315	Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java
316	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ
317	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ
318	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ
319	Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ
320	Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ
321	Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ
322	]
323
324	3. Now can compile ia-hadoop-tools:
325	cd ia-hadoop-tools
326	mvn package
327
328	4. Can't run it until guava.jar is on hadoop classpath. Locate a guava.jar and put it into an existing location checked for by hadoop classpath:
329
330	locate guava.jar
331	# found in /usr/share/java/guava.jar and /usr/share/maven/lib/guava.jar
332	diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
333	# identical/no difference, so can use either
334	sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
335	# now guava.jar has been copied into a location on hadoop classpath
336
337
338	Having done the above, our bash script will now be able to convert WARC to WET files when it runs:
339	$HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/PATH/TO/warc/*.warc.gz
340	Our script expects a specific folder structure: there should be a "warc" folder (containing the warc files), which is supplied as above, but also an empty "wet" and "wat" folder at the same level as the "warc" folder.
341
342
343	When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
344
345	-----------------------------------
346	G. Getting and running our scripts
347	-----------------------------------
348
349	1. Grab our 1st bash script and put it into the /home/vagrant/cc-index-table/src/script:
350	cd cc-index-table/src/script
351	wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_maori_WET_records_for_crawl.sh
352	chmod u+x get_maori_WET_records_for_crawl.sh
353
354	RUN AS:
355	cd cc-index-table
356	./src/script/get_maori_WET_records_for_crawl.sh <crawl-timestamp>
357	where crawl-timestamp of form "CC-MAIN-YYYY-##" >= September 2019
358
359	OUTPUT:
360	After hours of processing (leave it to run overnight), you should end up with:
361	hdfs dfs -ls /user/vagrant/<crawl-timestamp>
362	In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/
363	that we want would have been copied into /vagrant/<crawl-timestamp>-wet-files/
364
365
366	The script get_maori_WET_records_for_crawl.sh
367	- takes a crawl timestamp of the form "CC-MAIN-YYYY-##" from Sep 2018 onwards (before which content_languages were not indexed). The legitimate crawl timestampts are listed in the first column at http://index.commoncrawl.org/
368	- runs a spark job against CC's AWS bucket over s3a to create a csv table of MRI language records
369	- runs a spark job to download all the WARC records from CC's AWS that are denoted by the csv file's records into zipped warc files
370	- converts WARC to WET: locally converts the downloaded warc.gz files into warc.wet.gz (and warc.wat.gz) files
371
372
373	2. Grab our 2nd bash script and put it into the top level of cc-index-table (/home/vagrant/cc-index/table):
374
375	cd cc-index-table
376	wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_Maori_WET_records_from_CCSep2018_on.sh
377	chmod u+x get_Maori_WET_records_from_CCSep2018_on.sh
378
379	RUN FROM cc-index-table DIRECTORY AS:
380	(cd cc-index-table)
381	./get_Maori_WET_records_from_CCSep2018_on.sh
382
383	This script just runs the 1st script cc-index-table/src/script/get_maori_WET_records_for_crawl.sh (above) to process all listed common-crawls since September 2018.
384	If any fails, then the script will terminate. Else it runs against each common-crawl in sequence.
385
386	NOTE: If needed, update the script with more recent crawl timestamps from http://index.commoncrawl.org/
387
388	OUTPUT:
389	After days of running, will end up with:
390	hdfs:///user/vagrant/<crawl-timestamp>/wet/
391	for each crawl-timestamp listed in the script,
392	which at present would have got copied into
393	/vagrant/<crawl-timestamp>-wet-files/
394
395	Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
396
397	-----------------------------------
398	H. Austici crawl
399	-----------------------------------
400	Austici's crawl: CLI to download web sites as WARCs, features basics to avoid crawler taps.
401
402	Out of several software to do site mirroring, Autistici's "crawl" seemed promising:
403	https://anarc.at/services/archive/web/
404
405	- CLI.
406	- Can download a website quite simply, though flags for additional settings are available.
407	- Coded to prevent common traps.
408	- Downloads website as WARC file
409	- Now I have the WARC to WET process working for the WARC file it produced for the usual test site (Dr Bainbridge's home page)
410
411	Need to have Go installed in order to install and run Autistici's crawl.
412	Not a problem, because I can do it on the remote machine (which also hosts the hdfs) where I have sudo powers.
413
414	INSTRUCTIONS
415
416	1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
417	2. Create go environment:
418	#!/bin/bash
419	# environment vars for golang
420	export GOROOT=/usr/local/go
421	export GOPATH=$HOME/go
422	export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
423	3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
424
425	These steps work:
426
427	cd $GOPATH
428	mkdir bin
429	mkdir src
430	cd src
431
432	4. Since trying to go install the crawl url didn't work
433	https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
434
435	vagrant@node2:~/go/src$
436	mkdir -p git.autistici.org/ale
437	cd git.autistici.org/ale
438	git clone https://git.autistici.org/ale/crawl.git
439
440	[Now can run the install command in README.md:]
441	cd $GOPATH/src
442	go install git.autistici.org/ale/crawl/cmd/crawl
443
444	Now we should have a $GOPATH/bin folder containing the "crawl" binary
445
446	5. Run a crawl:
447	cd $GOPATH/bin
448	./crawl https://www.cs.waikato.ac.nz/~davidb/
449
450	which downloads the site and puts the warc file into the $GOPATH/bin folder.
451
452	More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
453
454	6. To view the RAW contents of a WARC file:
455	https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
456
457	zless <warc-file-name>
458
459	zless already installed on vagrant file
460
461
462	-----------------------------------------------------------------------------------------------
463	How to run warc-to-wet conversion on sites downloaded as WARCs by Austici's "crawl"
464	-----------------------------------------------------------------------------------------------
465	ISSUES CONVERTING WARC to WET:
466	---
467	WARC files produced by Autistici crawl are of a somewhat different format to CommonCrawl WARCs.
468	- missing elements in header
469	- different header elements
470	- ordering different (if that matters)
471
472	But WET is an official format, not CommonCrawl specific, as indicated by
473
474	https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
475	"WET (parsed text)
476
477	WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format."
478
479	So must be possible to get WARC to WET conversion used for CommonCrawl data to work on Autistici crawl's WARC files.
480
481
482	RESOLUTION:
483	---
484	I made changes to 2 java source files in the 2 github projects ia-web-commons and ia-hadoop-tools, which we use for the WARC to WET processing of CommonCrawl data. These gitprojects (with modifications for commoncrawl) are already on http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/gitprojects.
485
486	The changed files are as follows:
487	1. patches/WATExtractorOutput.java
488	put into ia-web-commons/src/main/java/org/archive/extract
489	after renaming existing to .orig
490
491	THEN RECOMPILE ia-web-commons with:
492	mvn install
493
494	2. patches/GZRangeClient.java
495	put into ia-hadoop-tools/src/main/java/org/archive/server
496	after renaming existing to .orig
497
498	THEN RECOMPILE ia-hadoop-tools with:
499	mvn package
500
501	Make sure to first compile ia-web-commons, then ia-hadoop-tools.
502
503
504	The modifications made to the above 2 files are as follows:
505	>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
506	1. ia-web-commons/src/main/java/org/archive/extract/WATExtractorOutput.java
507
508	[diff src/main/java/org/archive/extract/WATExtractorOutput.orig src/main/java/org/archive/extract/WATExtractorOutput.java]
509
510	162,163c162,163
511	< targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Filename");
512	< } else {
513	---
514	> targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Warcinfo-ID");
515	> } else {
516
517
518	2. ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java
519
520	[diff src/main/java/org/archive/server/GZRangeClient.orig src/main/java/org/archive/server/GZRangeClient.java]
521
522	76,83c76,82
523	< "WARC/1.0\r\n" +
524	< "WARC-Type: warcinfo\r\n" +
525	< "WARC-Date: %s\r\n" +
526	< "WARC-Filename: %s\r\n" +
527	< "WARC-Record-ID: <urn:uuid:%s>\r\n" +
528	< "Content-Type: application/warc-fields\r\n" +
529	< "Content-Length: %d\r\n\r\n";
530	<
531	---
532	> "WARC/1.0\r\n" +
533	> "Content-Type: application/warc-fields\r\n" +
534	> "WARC-Type: warcinfo\r\n" +
535	> "WARC-Warcinfo-ID: <urn:uuid:%s>\r\n" +
536	> "Content-Length: %d\r\n\r\n" +
537	> "WARC-Record-ID: <urn:uuid:%s>\r\n" +
538	> "WARC-Date: %s\r\n";
539	115,119c114,119
540	< private static String DEFAULT_WARC_PATTERN = "software: %s Extractor\r\n" +
541	< "format: WARC File Format 1.0\r\n" +
542	< "conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n" +
543	< "publisher: Internet Archive\r\n" +
544	< "created: %s\r\n\r\n";
545	---
546	> private static String DEFAULT_WARC_PATTERN = "Software: crawl/1.0\r\n" +
547	> "Format: WARC File Format 1.0\r\n" +
548	> "Conformsto: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n\r\n";
549	> // +
550	> //"publisher: Internet Archive\r\n" +
551	> //"created: %s\r\n\r\n";
552	<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
553
554
555	3. To run WARC to WET, the warc needs to live on hdfs in a warc folder and there should be wet and wat folders at the same level.
556
557	For example, assume that running Autistici's crawl generated $GOPATH/bin/crawl.warc.gz
558	(default location and filename unless you pass flags to crawl CLI to control these)
559
560	a. Ensure you get crawl.warc.gz onto the vagrant VM with the WARC to WET git projects installed, recompiled with the above modifications.
561
562	b. Now, create the folder structure needed for warc-to-wet conversion:
563	hdfs dfs -mkdir /user/vagrant/warctest
564	hdfs dfs -mkdir /user/vagrant/warctest/warc
565	hdfs dfs -mkdir /user/vagrant/warctest/wet
566	hdfs dfs -mkdir /user/vagrant/warctest/wat
567
568	c. Put crawl.warc.gz into the warc folder on hfds:
569	hdfs dfs -put crawl.warc.gz /user/vagrant/warctest/warc/.
570
571	d. Finally, time to run the actual warc-to-wet conversion from ia-hadoop-tools:
572	cd ia-hadoop-tools
573	WARC_FOLDER=/user/vagrant/warctest/warc
574	$HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz $WARC_FOLDER/crawl*.warc.gz
575
576	More meaningful when the WARC_FOLDER contains multiple *.warc.gz files,
577	as the above will use map-reduce to generate the *.warc.wet.gz files in the output wet folder.
578
579	e. Copy the generated wet files across from /user/vagrant/warctest/wet/:
580
581	(cd /vagrant or else
582	cd /home/vagrant
583	)
584	hdfs dfs -get /user/vagrant/warctest/wet/crawl.warc.wet.gz .
585
586	or, when dealing with multiple input warc files, we'll have multiple wet files:
587	hdfs dfs -get /user/vagrant/warctest/wet/*.warc.wet.gz
588
589
590	f. Now can view the contents of the WET files to confirm they are what we want:
591	gunzip crawl.warc.wet.gz
592	zless crawl.warc.wet
593
594	The wet file contents should look good now: the web pages as WET records without html tags.
595
596
597	-----------------------EOF------------------------
598

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT@ 33541

Download in other formats: