source: gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT@ 33598

Last change on this file since 33598 was 33598, checked in by ak19, 5 years ago

More instructions on setting up Nutch now that I've remembered to commit the prepared conf files. I've also added the instructions into the top-level GS_README here, since it was a pain untarring the vagrant-for-nutch2 tarball each time just to read the instructions I included in it.

File size: 35.2 KB
Line 
1----------------------------------------
2INDEX: follow in sequence
3----------------------------------------
4A. VAGRANT VM WITH HADOOP AND SPARK
5B. Create IAM role on Amazon AWS to use S3a
6C. Configure Spark on your vagrant VM with the AWS authentication details
7---
8Script scripts/setup.sh now is automated to do the steps in D-F below
9and prints out the main instruction for G.
10---
11D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
12E. Setup cc-index-table git project
13F. Setup warc-to-wet tools (git projects)
14G. Getting and running our scripts
15---
16H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
17I. Setting up Nutch v2 on its own Vagrant VM machine
18J. Automated crawling with Nutch v2.3.1 and post-processing
19
20----------------------------------------
21
22----------------------------------------
23A. VAGRANT VM WITH HADOOP AND SPARK
24----------------------------------------
25Set up vagrant with hadoop and spark as follows
26
271. by following the instructions at
28https://github.com/martinprobson/vagrant-hadoop-hive-spark
29
30This will eventually create the following folder, which will contain Vagrantfile
31/home/<USER>/vagrant-hadoop-hive-spark
32
332. If there are other vagrant VMs set up according to the same instructions on the same machine, then need to change the forwarded ports (the 2nd column of ports) in the file "Vagrantfile". In the example below, excerpted from my Vagrantfile, I've incremented the forwarded ports by 1:
34
35 config.vm.network "forwarded_port", guest: 8080, host: 8081
36 config.vm.network "forwarded_port", guest: 8088, host: 8089
37 config.vm.network "forwarded_port", guest: 9083, host: 9084
38 config.vm.network "forwarded_port", guest: 4040, host: 4041
39 config.vm.network "forwarded_port", guest: 18888, host: 18889
40 config.vm.network "forwarded_port", guest: 16010, host: 16011
41
42Remember to visit the adjusted ports on the running VM.
43
443. The most useful vagrant commands:
45vagrant up # start up the vagrant VM if not already running.
46 # May need to provide VM's ID if there's more than one vagrant VM
47ssh vagrant # ssh into the sole vagrant VM, else may need to provide vagrant VM's ID
48
49vagrant halt # to shutdown the vagrant VM. Provide VM's ID if there's more than one vagrant VM.
50
51(vagrant destroy) # to get rid of your vagrant VM. Useful if you've edited your Vagrantfile
52
53
544. Inside the VM, /home/<USER>/vagrant-hadoop-hive-spark will be shared and mounted as /vagrant
55Remember, this is the folder containing Vagrantfile. It's easy to use the shared folder to transfer files between the VM and the actual machine that hosts it.
56
575. Install EMACS, FIREFOX AND MAVEN on the vagrant VM:
58Start up vagrant machine ("vagrant up") and ssh into it ("ssh vagrant") if you haven't already.
59
60
61a. sudo apt-get -y install firefox
62
63b. sudo apt-get install emacs
64
65c. sudo apt-get install maven
66 (or sudo apt update
67 sudo apt install maven)
68
69Maven is needed for the commoncrawl github projects we'll be working with.
70
71
726. Although you can edit the Vagrantfile to have emacs and maven automatically installed when the vagrant VM is created, for firefox, you're advised to install it as above.
73
74To be able to view firefox from the machine hosting the VM, use a separate terminal and run:
75 vagrant ssh -- -Y
76[or "vagrant ssh -- -Y node1", if VM ID is node1]
77
78READING ON Vagrant:
79 * Guide: https://www.vagrantup.com/intro/getting-started/index.html
80 * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
81 * vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
82 * https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
83 * https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
84 sudo apt-get -y install firefox
85 * vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
86
87 * hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
88 * https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
89
90-------------------------------------------------
91B. Create IAM role on Amazon AWS to use S3 (S3a)
92-------------------------------------------------
93CommonCrawl (CC) crawl data is stored on Amazon S3, specifically the newest version Amazon s3a which has superceded both s3 and its earlier successor s3n.
94
95In order to have access to cc crawl data, need to create an IAM role on Dr Bainbridge's Amazon AWS account and configure its profile for commoncrawl.
96
971. Log into Dr Bainbridge's Amazon AWS account
98- In the aws management console:
99[email protected]
100lab pwd, capital R and ! (maybe g)
101
102
1032. Create a new "iam" role or user for "commoncrawl(er)" profile
104
1053. You can create the commoncrawl profile while creating the user/role, by following the instructions at https://answers.dataiku.com/1734/common-crawl-s3
106which states
107
108"Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user"
109
110#### START POLICY IN JSON FORMAT ###
111{
112 "Version": "2012-10-17",
113 "Statement": [
114 {
115 "Sid": "Stmt1503647467000",
116 "Effect": "Allow",
117 "Action": [
118 "s3:GetObject",
119 "s3:ListBucket"
120 ],
121 "Resource": [
122 "arn:aws:s3:::commoncrawl/*",
123 "arn:aws:s3:::commoncrawl"
124 ]
125 }
126 ]
127}
128#### END POLICY ###
129
130
131--------------------------------------------------------------------------
132C. Configure Spark on your vagrant VM with the AWS authentication details
133--------------------------------------------------------------------------
134Any Spark jobs run against the CommonCrawl data stored on Amazon s3a need to be able to authenticate with the AWS IAM role you created above. In order to do this, you'll want to put the Amazon AWS access key and secret key in the SPARK configuration properties file. (Instead of configuring these values in hadoop's core-site.xml, as in the latter case, the authentication details don't get copied across when distributed jobs are run to other computers in the distributed cluster that also need to know how to authenticate):
135
1361. Inside the vagrant vm:
137
138 sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
139 (sudo emacs $SPARK_HOME/conf/spark-defaults.conf)
140
1412. Edit the spark properties conf file to contain these 3 new properties:
142
143 spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
144 spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY_HERE
145 spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY_HERE
146
147Instructions on which properties to set were taken from:
148- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
149- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
150
151[NOTE, inactive alternative: Instead of editing spark's config file to set these properties, these properties can also be set in the bash script that executes the commoncrawl Spark jobs:
152
153$SPARK_HOME/bin/spark-submit \
154 ...
155 --conf spark.hadoop.fs.s3a.access.key=ACCESSKEY \
156 --conf spark.hadoop.fs.s3a.secret.key=SECRETKEY \
157 --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
158 ...
159
160But better not to hardcode authentication details into code, so I did it the first way.
161]
162
163----------------------------------------------------------------------
164NOTE:
165Script scripts/setup.sh now is automated to do the steps in D-F below
166and prints out the main instruction for G.
167
168
169----------------------------------------------------------------------
170D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
171----------------------------------------------------------------------
172The following 2 pages state that additional steps are necessary to get hadoop and spark to work with AWS S3a:
173
174- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
175- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
176
177I'm not sure whether these steps were really necessary in my case, and if so, whether it was A or B below that got things working for me. However, I have both A and B below set up.
178
179
180A. Check your maven installation for necessary jars:
181
1821. Installing maven may already have got the specifically recommended version of AWS-Java-SDK (aws-java-sdk-1.7.4.jar) and v2.7.6 hadoop-aws matching the vagrant VM's hadoop version (hadoop-aws-2.7.6.jar). Check these locations, as that's where I have them:
183- /home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
184- /home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar
185
186The specifically recommended v.1.7.4 from the instructions can be found off https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 at https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
187
1882. The script that runs the 2 Spark jobs uses the above paths for one of the spark jobs:
189 $SPARK_HOME/bin/spark-submit \
190 --jars file:/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar,file:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
191 --driver-class-path=/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
192
193However the other Spark job in the script does not set --jars or --driver-class-path, despite also referring to the s3a://commoncrawl table. So I'm not sure whether the jars are necessary or whether theywere just being ignored when provided.
194
195B. Download jar files and put them on the hadoop classpath:
196
1971. download the jar files:
198- I obtained aws-java-sdk-1.11.616.jar (v1.11) from https://aws.amazon.com/sdk-for-java/
199
200- I downloaded hadoop-aws 2.7.6 jar, as it goes with my version of hadoop, from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
201
2022. The easiest solution is to copy the 2 downloaded jars onto a location in the hadoop classpath.
203
204a. The command that shows the paths present on the Hadoop CLASSPATH:
205 hadoop classpath
206One of the paths this will list is /usr/local/hadoop-2.7.6/share/hadoop/common/
207
208b. SUDO COPY the 2 downloaded jar files, hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar, to this location:
209
210 sudo cp hadoop-aws-2.7.6.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
211 sudo cp aws-java-sdk-1.11.616.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
212
213Any hadoop jobs run will now find these 2 jar files on the classpath.
214
215[NOTE, unused alternative: Instead of copying the 2 jar files into a system location, assuming they were downloaded into /home/vagrant/lib, you can also export a custom folder's jar files into the hadoop classpath from the bash script that runs the spark jobs. This had no effect for me, and was commented out, and is another reason why I'm not sure if the 2 jar files were even necessary.
216#export LIBJARS=/home/vagrant/lib/*
217#export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
218]
219
220
221------------------------------------
222E. Setup cc-index-table git project
223------------------------------------
224Need to be inside the vagrant VM.
225
2261. Since you should have already installed maven, you can checkout and compile the cc-index-table git project.
227
228 git clone https://github.com/commoncrawl/cc-index-table.git
229
2302. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below:
231
23217c17,18
233< <spark.version>2.4.1</spark.version>
234---
235> <!--<spark.version>2.4.1</spark.version>-->
236> <spark.version>2.3.0</spark.version>
237135a137,143
238> <dependency>
239> <groupId>org.apache.hadoop</groupId>
240> <artifactId>hadoop-aws</artifactId>
241> <version>2.7.6</version>
242> </dependency>
243>
244
2453. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
246
247a. Set option(header) to false, since the csv file contains no header row, only data rows.
248 Change:
249 sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
250 .load(csvQueryResult);
251 To
252 sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
253 .load(csvQueryResult);
254
255b. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
256 Comment out:
257 //JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
258 .toJavaRDD();
259 Replace with the default inferred column names:
260 JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
261 .toJavaRDD();
262
263// TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here.
264
2654. Now (re)compile cc-index-table with the above modifications:
266
267 cd cc-index-table
268 mvn package
269
270-------------------------------
271F. Setup warc-to-wet tools
272-------------------------------
273To convert WARC files to WET (.warc.wet) files, need to checkout, set up and compile a couple more tools. These instructions are derived from those at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
274
2751. Grab and compile the 2 git projects for converting warc to wet:
276 git clone https://github.com/commoncrawl/ia-web-commons
277 cd ia-web-commons
278 mvn install
279
280 git clone https://github.com/commoncrawl/ia-hadoop-tools
281 cd ia-hadoop-tools
282 # can't compile this yet
283
284
2852. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json):
286
287 <dependency>
288 <groupId>org.json</groupId>
289 <artifactId>json</artifactId>
290 <version>20131018</version>
291 </dependency>
292
293[
294 UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above:
295 a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml:
296 ia-hadoop-tools>diff pom.xml.orig pom.xml
297
298 < <groupId>org.netpreserve.commons</groupId>
299 < <artifactId>webarchive-commons</artifactId>
300 < <version>1.1.1-SNAPSHOT</version>
301 ---
302 > <groupId>org.commoncrawl</groupId>
303 > <artifactId>ia-web-commons</artifactId>
304 > <version>1.1.9-SNAPSHOT</version>
305
306 b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
307
308 However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period.
309
310 ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/
311 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ
312 Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java
313 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ
314 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ
315 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ
316 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ
317 Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java
318 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ
319 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ
320 Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ
321 Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ
322 Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ
323 Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ
324]
325
3263. Now can compile ia-hadoop-tools:
327 cd ia-hadoop-tools
328 mvn package
329
3304. Can't run it until guava.jar is on hadoop classpath. Locate a guava.jar and put it into an existing location checked for by hadoop classpath:
331
332 locate guava.jar
333 # found in /usr/share/java/guava.jar and /usr/share/maven/lib/guava.jar
334 diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
335 # identical/no difference, so can use either
336 sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
337 # now guava.jar has been copied into a location on hadoop classpath
338
339
340Having done the above, our bash script will now be able to convert WARC to WET files when it runs:
341 $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/PATH/TO/warc/*.warc.gz
342Our script expects a specific folder structure: there should be a "warc" folder (containing the warc files), which is supplied as above, but also an empty "wet" and "wat" folder at the same level as the "warc" folder.
343
344
345When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
346
347-----------------------------------
348G. Getting and running our scripts
349-----------------------------------
350
3511. Grab our 1st bash script and put it into the /home/vagrant/cc-index-table/src/script:
352 cd cc-index-table/src/script
353 wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_maori_WET_records_for_crawl.sh
354 chmod u+x get_maori_WET_records_for_crawl.sh
355
356RUN AS:
357cd cc-index-table
358./src/script/get_maori_WET_records_for_crawl.sh <crawl-timestamp>
359 where crawl-timestamp of form "CC-MAIN-YYYY-##" >= September 2019
360
361OUTPUT:
362After hours of processing (leave it to run overnight), you should end up with:
363 hdfs dfs -ls /user/vagrant/<crawl-timestamp>
364In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/
365that we want would have been copied into /vagrant/<crawl-timestamp>-wet-files/
366
367
368The script get_maori_WET_records_for_crawl.sh
369- takes a crawl timestamp of the form "CC-MAIN-YYYY-##" from Sep 2018 onwards (before which content_languages were not indexed). The legitimate crawl timestampts are listed in the first column at http://index.commoncrawl.org/
370- runs a spark job against CC's AWS bucket over s3a to create a csv table of MRI language records
371- runs a spark job to download all the WARC records from CC's AWS that are denoted by the csv file's records into zipped warc files
372- converts WARC to WET: locally converts the downloaded warc.gz files into warc.wet.gz (and warc.wat.gz) files
373
374
3752. Grab our 2nd bash script and put it into the top level of cc-index-table (/home/vagrant/cc-index/table):
376
377 cd cc-index-table
378 wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_Maori_WET_records_from_CCSep2018_on.sh
379 chmod u+x get_Maori_WET_records_from_CCSep2018_on.sh
380
381RUN FROM cc-index-table DIRECTORY AS:
382 (cd cc-index-table)
383 ./get_Maori_WET_records_from_CCSep2018_on.sh
384
385This script just runs the 1st script cc-index-table/src/script/get_maori_WET_records_for_crawl.sh (above) to process all listed common-crawls since September 2018.
386If any fails, then the script will terminate. Else it runs against each common-crawl in sequence.
387
388NOTE: If needed, update the script with more recent crawl timestamps from http://index.commoncrawl.org/
389
390OUTPUT:
391After days of running, will end up with:
392 hdfs:///user/vagrant/<crawl-timestamp>/wet/
393for each crawl-timestamp listed in the script,
394which at present would have got copied into
395 /vagrant/<crawl-timestamp>-wet-files/
396
397Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
398
399-----------------------------------
400H. Austici crawl
401-----------------------------------
402Austici's crawl: CLI to download web sites as WARCs, features basics to avoid crawler taps.
403
404Out of several software to do site mirroring, Autistici's "crawl" seemed promising:
405https://anarc.at/services/archive/web/
406
407- CLI.
408- Can download a website quite simply, though flags for additional settings are available.
409- Coded to prevent common traps.
410- Downloads website as WARC file
411- Now I have the WARC to WET process working for the WARC file it produced for the usual test site (Dr Bainbridge's home page)
412
413Need to have Go installed in order to install and run Autistici's crawl.
414Not a problem, because I can do it on the remote machine (which also hosts the hdfs) where I have sudo powers.
415
416INSTRUCTIONS
417
4181. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
4192. Create go environment:
420#!/bin/bash
421# environment vars for golang
422export GOROOT=/usr/local/go
423export GOPATH=$HOME/go
424export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
4253. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
426
427These steps work:
428
429cd $GOPATH
430mkdir bin
431mkdir src
432cd src
433
4344. Since trying to go install the crawl url didn't work
435https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
436[https://stackoverflow.com/questions/26694271/go-install-doesnt-create-any-bin-file]
437
438vagrant@node2:~/go/src$
439 mkdir -p git.autistici.org/ale
440 cd git.autistici.org/ale
441 git clone https://git.autistici.org/ale/crawl.git
442
443[Now can run the install command in README.md:]
444 cd $GOPATH/src
445 go install git.autistici.org/ale/crawl/cmd/crawl
446
447Now we should have a $GOPATH/bin folder containing the "crawl" binary
448
4495. Run a crawl:
450 cd $GOPATH/bin
451 ./crawl https://www.cs.waikato.ac.nz/~davidb/
452
453which downloads the site and puts the warc file into the $GOPATH/bin folder.
454
455More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
456
4576. To view the RAW contents of a WARC file:
458https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
459
460zless <warc-file-name>
461
462zless already installed on vagrant file
463
464
465-----------------------------------------------------------------------------------------------
466How to run warc-to-wet conversion on sites downloaded as WARCs by Austici's "crawl"
467-----------------------------------------------------------------------------------------------
468ISSUES CONVERTING WARC to WET:
469---
470WARC files produced by Autistici crawl are of a somewhat different format to CommonCrawl WARCs.
471- missing elements in header
472- different header elements
473- ordering different (if that matters)
474
475But WET is an official format, not CommonCrawl specific, as indicated by
476
477https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
478"WET (parsed text)
479
480WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format."
481
482So must be possible to get WARC to WET conversion used for CommonCrawl data to work on Autistici crawl's WARC files.
483
484
485RESOLUTION:
486---
487I made changes to 2 java source files in the 2 github projects ia-web-commons and ia-hadoop-tools, which we use for the WARC to WET processing of CommonCrawl data. These gitprojects (with modifications for commoncrawl) are already on http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/gitprojects.
488
489The changed files are as follows:
4901. patches/WATExtractorOutput.java
491 put into ia-web-commons/src/main/java/org/archive/extract
492 after renaming existing to .orig
493
494THEN RECOMPILE ia-web-commons with:
495 mvn install
496
4972. patches/GZRangeClient.java
498 put into ia-hadoop-tools/src/main/java/org/archive/server
499 after renaming existing to .orig
500
501THEN RECOMPILE ia-hadoop-tools with:
502 mvn package
503
504Make sure to first compile ia-web-commons, then ia-hadoop-tools.
505
506
507The modifications made to the above 2 files are as follows:
508>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
5091. ia-web-commons/src/main/java/org/archive/extract/WATExtractorOutput.java
510
511[diff src/main/java/org/archive/extract/WATExtractorOutput.orig src/main/java/org/archive/extract/WATExtractorOutput.java]
512
513162,163c162,163
514< targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Filename");
515< } else {
516---
517> targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Warcinfo-ID");
518> } else {
519
520
5212. ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java
522
523[diff src/main/java/org/archive/server/GZRangeClient.orig src/main/java/org/archive/server/GZRangeClient.java]
524
52576,83c76,82
526< "WARC/1.0\r\n" +
527< "WARC-Type: warcinfo\r\n" +
528< "WARC-Date: %s\r\n" +
529< "WARC-Filename: %s\r\n" +
530< "WARC-Record-ID: <urn:uuid:%s>\r\n" +
531< "Content-Type: application/warc-fields\r\n" +
532< "Content-Length: %d\r\n\r\n";
533<
534---
535> "WARC/1.0\r\n" +
536> "Content-Type: application/warc-fields\r\n" +
537> "WARC-Type: warcinfo\r\n" +
538> "WARC-Warcinfo-ID: <urn:uuid:%s>\r\n" +
539> "Content-Length: %d\r\n\r\n" +
540> "WARC-Record-ID: <urn:uuid:%s>\r\n" +
541> "WARC-Date: %s\r\n";
542115,119c114,119
543< private static String DEFAULT_WARC_PATTERN = "software: %s Extractor\r\n" +
544< "format: WARC File Format 1.0\r\n" +
545< "conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n" +
546< "publisher: Internet Archive\r\n" +
547< "created: %s\r\n\r\n";
548---
549> private static String DEFAULT_WARC_PATTERN = "Software: crawl/1.0\r\n" +
550> "Format: WARC File Format 1.0\r\n" +
551> "Conformsto: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n\r\n";
552> // +
553> //"publisher: Internet Archive\r\n" +
554> //"created: %s\r\n\r\n";
555<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
556
557
5583. To run WARC to WET, the warc needs to live on hdfs in a warc folder and there should be wet and wat folders at the same level.
559
560For example, assume that running Autistici's crawl generated $GOPATH/bin/crawl.warc.gz
561(default location and filename unless you pass flags to crawl CLI to control these)
562
563a. Ensure you get crawl.warc.gz onto the vagrant VM that has the git projects for WARC-to-WET installed, recompiled with the above modifications.
564
565b. Now, create the folder structure needed for warc-to-wet conversion:
566 hdfs dfs -mkdir /user/vagrant/warctest
567 hdfs dfs -mkdir /user/vagrant/warctest/warc
568 hdfs dfs -mkdir /user/vagrant/warctest/wet
569 hdfs dfs -mkdir /user/vagrant/warctest/wat
570
571c. Put crawl.warc.gz into the warc folder on hfds:
572 hdfs dfs -put crawl.warc.gz /user/vagrant/warctest/warc/.
573
574d. Finally, time to run the actual warc-to-wet conversion from ia-hadoop-tools:
575 cd ia-hadoop-tools
576 WARC_FOLDER=/user/vagrant/warctest/warc
577 $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz $WARC_FOLDER/crawl*.warc.gz
578
579More meaningful when the WARC_FOLDER contains multiple *.warc.gz files,
580as the above will use map-reduce to generate a *.warc.wet.gz file in the output wet folder for each input *.warc.gz file.
581
582e. Copy the generated wet files across from /user/vagrant/warctest/wet/:
583
584 (cd /vagrant or else
585 cd /home/vagrant
586 )
587 hdfs dfs -get /user/vagrant/warctest/wet/crawl.warc.wet.gz .
588
589or, when dealing with multiple input warc files, we'll have multiple wet files:
590 hdfs dfs -get /user/vagrant/warctest/wet/*.warc.wet.gz
591
592
593f. Now can view the contents of the WET files to confirm they are what we want:
594 gunzip crawl.warc.wet.gz
595 zless crawl.warc.wet
596
597The wet file contents should look good now: the web pages as WET records without html tags.
598
599
600----------------------------------------------------
601I. Setting up Nutch v2 on its own Vagrant VM machine
602----------------------------------------------------
6031. Untar vagrant-for-nutch2.tar.gz
6042. Follow the instructions below. A copy is also in vagrant-for-nutch2/GS_README.txt
605
606---
607REASONING FOR THE NUTCH v2 SPECIFIC VAGRANT VM:
608---
609We were able to get nutch v1 working on a regular machine.
610
611From a few pages online starting with https://stackoverflow.com/questions/33354460/nutch-clone-website, it appeared that "./bin/nutch fetch -all" was the nutch command to mirror a web site. Nutch v2 introduced the -all flag to the ./bin/nutch fetch command. And nutch v2 required HBase which presupposes hadoop.
612
613Our vagrant VM for commoncrawl had an incompatible version of HBase but this version was needed for that VM's version of hadoop and spark. So Dr Bainbridge came up with the idea of having a separate Vagrant VM for Nutch v2 which would have the version of HBase it needed and a Hadoop version matching that. Compatible versions with nutch 2.3.1 are mentioned at https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
614(Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.)
615
616---
617 Vagrant VM for Nutch2
618---
619This vagrant virtual machine is based on https://github.com/martinprobson/vagrant-hadoop-hive-spark
620
621However:
622- It comes with the older versions of hadoop 2.5.2 and hbase 0.98.21, and no spark or hive or other packages.
623- the VM is called node2 with IP 10.211.55.102 (instead of node1 with IP 10.211.55.101)
624- Since not all packages are installed, fewer ports needed forwarding. And they're forwarded to portnumber+2 to not conflict with any vagrant VM that used the original vagrant image's forwarded port numbers.
625- scripts/common.sh uses HBASE_ARCHIVE of the form "hbase-${HBASE_VERSION}-hadoop2-bin.tar.gz" (the -hadoop2 suffix is specific to v0.98.21)
626- and hbase gets installed as /usr/local/hbase-$HBASE_VERSION-hadoop2 (again with the additional -hadoop2 suffix specific to v0.98.21) in scripts/setup-hbase.sh, so the symbolic link creation there needed to refer to a path of this form.
627
628INSTRUCTIONS:
629a. mostly follow the "Getting Started" instructions at https://github.com/martinprobson/vagrant-hadoop-hive-spark
630b. but after step 3, replace the github cloned Vagrantfile, scripts and resources folders with their modified counterparts included in this zip file.
631c. wherever the rest of that git page refers to "node1", IP "10.211.55.101" and specific port numbers, use instead "node2", IP "10.211.55.102" and the forwarded port numbers in the customised Vagrantfile.
632If there's already a node2/if IP "10.211.55.102" vagrant VM set up, then adjust all files in the git cloned vagrant vm already modified by the contents of this vagrant-for-nutch2 folder as follows:
633- increment all occurrences of node2 and "10.211.55.102" to node3 and IP "10.211.55.103", if not already taken, and
634- in the Vagrantfile increment forwarded ports by another 2 or so from the highest port number values already in use by other vagrant VMs.
635d. After doing "vagrant up --provider=virtualbox" to create the VM, do "vagrant ssh" or "vagrant ssh node<#>" (e.g. "vagrant ssh node2") for step 8 in the "Getting Started" section.
636e. Inside the VM, install emacs, maven, firefox:
637
638 sudo apt-get install emacs
639
640 sudo apt update
641 sudo apt install maven
642
643 sudo apt-get -y install firefox
644
645f. We set up nutch 2.3.1, which can be downloaded from https://archive.apache.org/dist/nutch/2.3.1/, as that version worked as per the nutch2 tutorial instructions with the configuration of specific versions of hadoop, hbase and gora for the vagrant VM described here.
646
647After untarring the nutch 2.3.1 source tarball,
648 1. move apache-nutch-2.3.1/conf/regex-urlfilter.txt to apache-nutch-2.3.1/conf/regex-urlfilter.txt.orig
649 2. download the two files nutch-site.xml and regex-urlfilter.GS_TEMPLATE from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/conf
650and put them into the apache-nutch-2.3.1/conf folder.
651 3. Then continue following the nutch tutorial 2 instructions at https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial to set up nutch2 (and apache-solr, if needed, but I didn't install apache-solr for nutch v2).
652 - nutch-site.xml has already been configured to do as much optimisation and speeding up of the crawling as we know about concerning nutch
653 - for each site that will be crawled, regex-urlfilter.GS_TEMPLATE will get copied as the live "regex-urlfilter.txt" file, and lines of regex filters will be appended to its end.
654
655------------------------------------------------------------------------
656J. Automated crawling with Nutch v2.3.1 and post-processing
657------------------------------------------------------------------------
6581. When you're ready to start crawling with Nutch 2.3.1,
659- copy the batchcrawl.sh file (from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts) into the vagrant machine at top level. Make the script executable.
660- copy the to_crawl.tar.gz file containing the "to_crawl" folder (generated by CCWETProcessor.java running of the common-crawl downloaded data where MRI was the primary language) and put it into the vagrant machine at toplevel.
661- run batchcrawl.sh on a site or range of sites not yet crawled, e.g.
662 ./batchcrawl.sh 00485-00500
663
6642. When crawling is done, the above will have generated the "crawled" folder containing a subfolder for each of the crawled sites, e.g. subfolders 00485 to 00500. Each crawled site folder will contain a dump.txt with the text output of the site's web pages. The "crawled" folder with site subfolders each containing a dump.txt file can be processed with NutchTextDumpProcessor.java.
665
666
667
668-----------------------EOF------------------------
669
Note: See TracBrowser for help on using the repository browser.