Context Navigation

source: other-projects/maori-lang-detection/hdfs-cc-work/GS_README.TXT@ 33825

Last change on this file since 33825 was 33825, checked in by ak19, 4 years ago
Beginnings of first draft of write up.
File size: 52.6 KB

Line
1	----------------------------------------
2	INDEX: follow in sequence
3	----------------------------------------
4	A. VAGRANT VM WITH HADOOP AND SPARK
5	B. Create IAM role on Amazon AWS to use S3a
6	C. Configure Spark on your vagrant VM with the AWS authentication details
7	---
8	Script scripts/setup.sh now is automated to do the steps in D-F below
9	and prints out the main instruction for G.
10	---
11	D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
12	E. Setup cc-index-table git project
13	F. Setup warc-to-wet tools (git projects)
14	G. Getting and running our scripts
15	---
16	H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
17	I. Setting up Nutch v2 on its own Vagrant VM machine
18	J. Automated crawling with Nutch v2.3.1 and post-processing
19	K. Sending the crawled data into mongodb with NutchTextDumpProcessor.java
20	---
21
22	APPENDIX: Legend of mongodb-data folder's contents
23	APPENDIX: Reading data from hbase tables and backing up hbase
24
25	----------------------------------------
26
27	----------------------------------------
28	A. VAGRANT VM WITH HADOOP AND SPARK
29	----------------------------------------
30	Set up vagrant with hadoop and spark as follows
31
32	1. by following the instructions at
33	https://github.com/martinprobson/vagrant-hadoop-hive-spark
34
35	This will eventually create the following folder, which will contain Vagrantfile
36	/home/<USER>/vagrant-hadoop-hive-spark
37
38	2. If there are other vagrant VMs set up according to the same instructions on the same machine, then need to change the forwarded ports (the 2nd column of ports) in the file "Vagrantfile". In the example below, excerpted from my Vagrantfile, I've incremented the forwarded ports by 1:
39
40	config.vm.network "forwarded_port", guest: 8080, host: 8081
41	config.vm.network "forwarded_port", guest: 8088, host: 8089
42	config.vm.network "forwarded_port", guest: 9083, host: 9084
43	config.vm.network "forwarded_port", guest: 4040, host: 4041
44	config.vm.network "forwarded_port", guest: 18888, host: 18889
45	config.vm.network "forwarded_port", guest: 16010, host: 16011
46
47	Remember to visit the adjusted ports on the running VM.
48
49	3. The most useful vagrant commands:
50	vagrant up # start up the vagrant VM if not already running.
51	# May need to provide VM's ID if there's more than one vagrant VM
52	ssh vagrant # ssh into the sole vagrant VM, else may need to provide vagrant VM's ID
53
54	vagrant halt # to shutdown the vagrant VM. Provide VM's ID if there's more than one vagrant VM.
55
56	(vagrant destroy) # to get rid of your vagrant VM. Useful if you've edited your Vagrantfile
57
58
59	4. Inside the VM, /home/<USER>/vagrant-hadoop-hive-spark will be shared and mounted as /vagrant
60	Remember, this is the folder containing Vagrantfile. It's easy to use the shared folder to transfer files between the VM and the actual machine that hosts it.
61
62	5. Install EMACS, FIREFOX AND MAVEN on the vagrant VM:
63	Start up vagrant machine ("vagrant up") and ssh into it ("ssh vagrant") if you haven't already.
64
65
66	a. sudo apt-get -y install firefox
67
68	b. sudo apt-get install emacs
69
70	c. sudo apt-get install maven
71	(or sudo apt update
72	sudo apt install maven)
73
74	Maven is needed for the commoncrawl github projects we'll be working with.
75
76
77	6. Although you can edit the Vagrantfile to have emacs and maven automatically installed when the vagrant VM is created, for firefox, you're advised to install it as above.
78
79	To be able to view firefox from the machine hosting the VM, use a separate terminal and run:
80	vagrant ssh -- -Y
81	[or "vagrant ssh -- -Y node1", if VM ID is node1]
82
83	READING ON Vagrant:
84	* Guide: https://www.vagrantup.com/intro/getting-started/index.html
85	* Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
86	* vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
87	* https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
88	* https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
89	sudo apt-get -y install firefox
90	* vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
91
92	* hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
93	* https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
94
95	-------------------------------------------------
96	B. Create IAM role on Amazon AWS to use S3 (S3a)
97	-------------------------------------------------
98	CommonCrawl (CC) crawl data is stored on Amazon S3, specifically the newest version Amazon s3a which has superceded both s3 and its earlier successor s3n.
99
100	In order to have access to cc crawl data, need to create an IAM role on Dr Bainbridge's Amazon AWS account and configure its profile for commoncrawl.
101
102	1. Log into Dr Bainbridge's Amazon AWS account
103	- In the aws management console:
104	[email protected]
105	lab pwd, capital R and ! (maybe g)
106
107
108	2. Create a new "iam" role or user for "commoncrawl(er)" profile
109
110	3. You can create the commoncrawl profile while creating the user/role, by following the instructions at https://answers.dataiku.com/1734/common-crawl-s3
111	which states
112
113	"Even though the bucket is public, if your AWS key does not have your full permissions (ie if it's a restricted IAM user), you need to grant explicit access to the commoncrawl bucket: attach the following policy to your IAM user"
114
115	#### START POLICY IN JSON FORMAT ###
116	{
117	"Version": "2012-10-17",
118	"Statement": [
119	{
120	"Sid": "Stmt1503647467000",
121	"Effect": "Allow",
122	"Action": [
123	"s3:GetObject",
124	"s3:ListBucket"
125	],
126	"Resource": [
127	"arn:aws:s3:::commoncrawl/*",
128	"arn:aws:s3:::commoncrawl"
129	]
130	}
131	]
132	}
133	#### END POLICY ###
134
135
136	--------------------------------------------------------------------------
137	C. Configure Spark on your vagrant VM with the AWS authentication details
138	--------------------------------------------------------------------------
139	Any Spark jobs run against the CommonCrawl data stored on Amazon s3a need to be able to authenticate with the AWS IAM role you created above. In order to do this, you'll want to put the Amazon AWS access key and secret key in the SPARK configuration properties file. (Instead of configuring these values in hadoop's core-site.xml, as in the latter case, the authentication details don't get copied across when distributed jobs are run to other computers in the distributed cluster that also need to know how to authenticate):
140
141	1. Inside the vagrant vm:
142
143	sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
144	(sudo emacs $SPARK_HOME/conf/spark-defaults.conf)
145
146	2. Edit the spark properties conf file to contain these 3 new properties:
147
148	spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
149	spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY_HERE
150	spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY_HERE
151
152	Instructions on which properties to set were taken from:
153	- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
154	- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
155
156	[NOTE, inactive alternative: Instead of editing spark's config file to set these properties, these properties can also be set in the bash script that executes the commoncrawl Spark jobs:
157
158	$SPARK_HOME/bin/spark-submit \
159	...
160	--conf spark.hadoop.fs.s3a.access.key=ACCESSKEY \
161	--conf spark.hadoop.fs.s3a.secret.key=SECRETKEY \
162	--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
163	...
164
165	But better not to hardcode authentication details into code, so I did it the first way.
166	]
167
168	----------------------------------------------------------------------
169	NOTE:
170	Script scripts/setup.sh now is automated to do the steps in D-F below
171	and prints out the main instruction for G.
172
173
174	----------------------------------------------------------------------
175	D. OPTIONAL? Further configuration for Hadoop to work with Amazon AWS
176	----------------------------------------------------------------------
177	The following 2 pages state that additional steps are necessary to get hadoop and spark to work with AWS S3a:
178
179	- https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
180	- https://community.cloudera.com/t5/Community-Articles/HDP-2-4-0-and-Spark-1-6-0-connecting-to-AWS-S3-buckets/ta-p/245760
181
182	I'm not sure whether these steps were really necessary in my case, and if so, whether it was A or B below that got things working for me. However, I have both A and B below set up.
183
184
185	A. Check your maven installation for necessary jars:
186
187	1. Installing maven may already have got the specifically recommended version of AWS-Java-SDK (aws-java-sdk-1.7.4.jar) and v2.7.6 hadoop-aws matching the vagrant VM's hadoop version (hadoop-aws-2.7.6.jar). Check these locations, as that's where I have them:
188	- /home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
189	- /home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar
190
191	The specifically recommended v.1.7.4 from the instructions can be found off https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 at https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
192
193	2. The script that runs the 2 Spark jobs uses the above paths for one of the spark jobs:
194	$SPARK_HOME/bin/spark-submit \
195	--jars file:/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar,file:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
196	--driver-class-path=/home/vagrant/.m2/repository/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar:/home/vagrant/.m2/repository/org/apache/hadoop/hadoop-aws/2.7.6/hadoop-aws-2.7.6.jar \
197
198	However the other Spark job in the script does not set --jars or --driver-class-path, despite also referring to the s3a://commoncrawl table. So I'm not sure whether the jars are necessary or whether theywere just being ignored when provided.
199
200	B. Download jar files and put them on the hadoop classpath:
201
202	1. download the jar files:
203	- I obtained aws-java-sdk-1.11.616.jar (v1.11) from https://aws.amazon.com/sdk-for-java/
204
205	- I downloaded hadoop-aws 2.7.6 jar, as it goes with my version of hadoop, from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.6
206
207	2. The easiest solution is to copy the 2 downloaded jars onto a location in the hadoop classpath.
208
209	a. The command that shows the paths present on the Hadoop CLASSPATH:
210	hadoop classpath
211	One of the paths this will list is /usr/local/hadoop-2.7.6/share/hadoop/common/
212
213	b. SUDO COPY the 2 downloaded jar files, hadoop-aws-2.7.6.jar and aws-java-sdk-1.11.616.jar, to this location:
214
215	sudo cp hadoop-aws-2.7.6.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
216	sudo cp aws-java-sdk-1.11.616.jar /usr/local/hadoop-2.7.6/share/hadoop/common/.
217
218	Any hadoop jobs run will now find these 2 jar files on the classpath.
219
220	[NOTE, unused alternative: Instead of copying the 2 jar files into a system location, assuming they were downloaded into /home/vagrant/lib, you can also export a custom folder's jar files into the hadoop classpath from the bash script that runs the spark jobs. This had no effect for me, and was commented out, and is another reason why I'm not sure if the 2 jar files were even necessary.
221	#export LIBJARS=/home/vagrant/lib/*
222	#export HADOOP_CLASSPATH=`echo ${LIBJARS} \| sed s/,/:/g`
223	]
224
225
226	------------------------------------
227	E. Setup cc-index-table git project
228	------------------------------------
229	Need to be inside the vagrant VM.
230
231	1. Since you should have already installed maven, you can checkout and compile the cc-index-table git project.
232
233	git clone https://github.com/commoncrawl/cc-index-table.git
234
235	2. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below:
236
237	17c17,18
238	< <spark.version>2.4.1</spark.version>
239	---
240	> <!--<spark.version>2.4.1</spark.version>-->
241	> <spark.version>2.3.0</spark.version>
242	135a137,143
243	> <dependency>
244	> <groupId>org.apache.hadoop</groupId>
245	> <artifactId>hadoop-aws</artifactId>
246	> <version>2.7.6</version>
247	> </dependency>
248	>
249
250	3. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
251
252	a. Set option(header) to false, since the csv file contains no header row, only data rows.
253	Change:
254	sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
255	.load(csvQueryResult);
256	To
257	sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
258	.load(csvQueryResult);
259
260	b. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
261	Comment out:
262	//JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
263	.toJavaRDD();
264	Replace with the default inferred column names:
265	JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
266	.toJavaRDD();
267
268	// TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here.
269
270	4. Now (re)compile cc-index-table with the above modifications:
271
272	cd cc-index-table
273	mvn package
274
275	-------------------------------
276	F. Setup warc-to-wet tools
277	-------------------------------
278	To convert WARC files to WET (.warc.wet) files, need to checkout, set up and compile a couple more tools. These instructions are derived from those at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
279
280	1. Grab and compile the 2 git projects for converting warc to wet:
281	git clone https://github.com/commoncrawl/ia-web-commons
282	cd ia-web-commons
283	mvn install
284
285	git clone https://github.com/commoncrawl/ia-hadoop-tools
286	cd ia-hadoop-tools
287	# can't compile this yet
288
289
290	2. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json):
291
292	<dependency>
293	<groupId>org.json</groupId>
294	<artifactId>json</artifactId>
295	<version>20131018</version>
296	</dependency>
297
298	[
299	UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above:
300	a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml:
301	ia-hadoop-tools>diff pom.xml.orig pom.xml
302
303	< <groupId>org.netpreserve.commons</groupId>
304	< <artifactId>webarchive-commons</artifactId>
305	< <version>1.1.1-SNAPSHOT</version>
306	---
307	> <groupId>org.commoncrawl</groupId>
308	> <artifactId>ia-web-commons</artifactId>
309	> <version>1.1.9-SNAPSHOT</version>
310
311	b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
312
313	However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period.
314
315	ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/
316	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ
317	Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java
318	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ
319	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ
320	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ
321	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ
322	Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java
323	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ
324	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ
325	Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ
326	Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ
327	Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ
328	Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ
329	]
330
331	3. Now can compile ia-hadoop-tools:
332	cd ia-hadoop-tools
333	mvn package
334
335	4. Can't run it until guava.jar is on hadoop classpath. Locate a guava.jar and put it into an existing location checked for by hadoop classpath:
336
337	locate guava.jar
338	# found in /usr/share/java/guava.jar and /usr/share/maven/lib/guava.jar
339	diff /usr/share/java/guava.jar /usr/share/maven/lib/guava.jar
340	# identical/no difference, so can use either
341	sudo cp /usr/share/java/guava.jar /usr/local/hadoop/share/hadoop/common/.
342	# now guava.jar has been copied into a location on hadoop classpath
343
344
345	Having done the above, our bash script will now be able to convert WARC to WET files when it runs:
346	$HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz hdfs:///user/vagrant/PATH/TO/warc/*.warc.gz
347	Our script expects a specific folder structure: there should be a "warc" folder (containing the warc files), which is supplied as above, but also an empty "wet" and "wat" folder at the same level as the "warc" folder.
348
349
350	When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
351
352	-----------------------------------
353	G. Getting and running our scripts
354	-----------------------------------
355
356	1. Grab our 1st bash script and put it into the /home/vagrant/cc-index-table/src/script:
357	cd cc-index-table/src/script
358	wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_maori_WET_records_for_crawl.sh
359	chmod u+x get_maori_WET_records_for_crawl.sh
360
361	RUN AS:
362	cd cc-index-table
363	./src/script/get_maori_WET_records_for_crawl.sh <crawl-timestamp>
364	where crawl-timestamp of form "CC-MAIN-YYYY-##" >= September 2019
365
366	OUTPUT:
367	After hours of processing (leave it to run overnight), you should end up with:
368	hdfs dfs -ls /user/vagrant/<crawl-timestamp>
369	In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/
370	that we want would have been copied into /vagrant/<crawl-timestamp>-wet-files/
371
372
373	The script get_maori_WET_records_for_crawl.sh
374	- takes a crawl timestamp of the form "CC-MAIN-YYYY-##" from Sep 2018 onwards (before which content_languages were not indexed). The legitimate crawl timestampts are listed in the first column at http://index.commoncrawl.org/
375	- runs a spark job against CC's AWS bucket over s3a to create a csv table of MRI language records
376	- runs a spark job to download all the WARC records from CC's AWS that are denoted by the csv file's records into zipped warc files
377	- converts WARC to WET: locally converts the downloaded warc.gz files into warc.wet.gz (and warc.wat.gz) files
378
379
380	2. Grab our 2nd bash script and put it into the top level of cc-index-table (/home/vagrant/cc-index/table):
381
382	cd cc-index-table
383	wget http://svn.greenstone.org/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_Maori_WET_records_from_CCSep2018_on.sh
384	chmod u+x get_Maori_WET_records_from_CCSep2018_on.sh
385
386	RUN FROM cc-index-table DIRECTORY AS:
387	(cd cc-index-table)
388	./get_Maori_WET_records_from_CCSep2018_on.sh
389
390	This script just runs the 1st script cc-index-table/src/script/get_maori_WET_records_for_crawl.sh (above) to process all listed common-crawls since September 2018.
391	If any fails, then the script will terminate. Else it runs against each common-crawl in sequence.
392
393	NOTE: If needed, update the script with more recent crawl timestamps from http://index.commoncrawl.org/
394
395	OUTPUT:
396	After days of running, will end up with:
397	hdfs:///user/vagrant/<crawl-timestamp>/wet/
398	for each crawl-timestamp listed in the script,
399	which at present would have got copied into
400	/vagrant/<crawl-timestamp>-wet-files/
401
402	Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
403
404	-----------------------------------
405	H. Austici crawl
406	-----------------------------------
407	Austici's crawl: CLI to download web sites as WARCs, features basics to avoid crawler taps.
408
409	Out of several software to do site mirroring, Autistici's "crawl" seemed promising:
410	https://anarc.at/services/archive/web/
411
412	- CLI.
413	- Can download a website quite simply, though flags for additional settings are available.
414	- Coded to prevent common traps.
415	- Downloads website as WARC file
416	- Now I have the WARC to WET process working for the WARC file it produced for the usual test site (Dr Bainbridge's home page)
417
418	Need to have Go installed in order to install and run Autistici's crawl.
419	Not a problem, because I can do it on the remote machine (which also hosts the hdfs) where I have sudo powers.
420
421	INSTRUCTIONS
422
423	1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
424	2. Create go environment:
425	#!/bin/bash
426	# environment vars for golang
427	export GOROOT=/usr/local/go
428	export GOPATH=$HOME/go
429	export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
430	3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
431
432	These steps work:
433
434	cd $GOPATH
435	mkdir bin
436	mkdir src
437	cd src
438
439	4. Since trying to go install the crawl url didn't work
440	https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
441	[https://stackoverflow.com/questions/26694271/go-install-doesnt-create-any-bin-file]
442
443	vagrant@node2:~/go/src$
444	mkdir -p git.autistici.org/ale
445	cd git.autistici.org/ale
446	git clone https://git.autistici.org/ale/crawl.git
447
448	[Now can run the install command in README.md:]
449	cd $GOPATH/src
450	go install git.autistici.org/ale/crawl/cmd/crawl
451
452	Now we should have a $GOPATH/bin folder containing the "crawl" binary
453
454	5. Run a crawl:
455	cd $GOPATH/bin
456	./crawl https://www.cs.waikato.ac.nz/~davidb/
457
458	which downloads the site and puts the warc file into the $GOPATH/bin folder.
459
460	More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
461
462	6. To view the RAW contents of a WARC file:
463	https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
464
465	zless <warc-file-name>
466
467	zless already installed on vagrant file
468
469
470	-----------------------------------------------------------------------------------------------
471	How to run warc-to-wet conversion on sites downloaded as WARCs by Austici's "crawl"
472	-----------------------------------------------------------------------------------------------
473	ISSUES CONVERTING WARC to WET:
474	---
475	WARC files produced by Autistici crawl are of a somewhat different format to CommonCrawl WARCs.
476	- missing elements in header
477	- different header elements
478	- ordering different (if that matters)
479
480	But WET is an official format, not CommonCrawl specific, as indicated by
481
482	https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
483	"WET (parsed text)
484
485	WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format."
486
487	So must be possible to get WARC to WET conversion used for CommonCrawl data to work on Autistici crawl's WARC files.
488
489
490	RESOLUTION:
491	---
492	I made changes to 2 java source files in the 2 github projects ia-web-commons and ia-hadoop-tools, which we use for the WARC to WET processing of CommonCrawl data. These gitprojects (with modifications for commoncrawl) are already on http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/gitprojects.
493
494	The changed files are as follows:
495	1. patches/WATExtractorOutput.java
496	put into ia-web-commons/src/main/java/org/archive/extract
497	after renaming existing to .orig
498
499	THEN RECOMPILE ia-web-commons with:
500	mvn install
501
502	2. patches/GZRangeClient.java
503	put into ia-hadoop-tools/src/main/java/org/archive/server
504	after renaming existing to .orig
505
506	THEN RECOMPILE ia-hadoop-tools with:
507	mvn package
508
509	Make sure to first compile ia-web-commons, then ia-hadoop-tools.
510
511
512	The modifications made to the above 2 files are as follows:
513	>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
514	1. ia-web-commons/src/main/java/org/archive/extract/WATExtractorOutput.java
515
516	[diff src/main/java/org/archive/extract/WATExtractorOutput.orig src/main/java/org/archive/extract/WATExtractorOutput.java]
517
518	162,163c162,163
519	< targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Filename");
520	< } else {
521	---
522	> targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Warcinfo-ID");
523	> } else {
524
525
526	2. ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java
527
528	[diff src/main/java/org/archive/server/GZRangeClient.orig src/main/java/org/archive/server/GZRangeClient.java]
529
530	76,83c76,82
531	< "WARC/1.0\r\n" +
532	< "WARC-Type: warcinfo\r\n" +
533	< "WARC-Date: %s\r\n" +
534	< "WARC-Filename: %s\r\n" +
535	< "WARC-Record-ID: <urn:uuid:%s>\r\n" +
536	< "Content-Type: application/warc-fields\r\n" +
537	< "Content-Length: %d\r\n\r\n";
538	<
539	---
540	> "WARC/1.0\r\n" +
541	> "Content-Type: application/warc-fields\r\n" +
542	> "WARC-Type: warcinfo\r\n" +
543	> "WARC-Warcinfo-ID: <urn:uuid:%s>\r\n" +
544	> "Content-Length: %d\r\n\r\n" +
545	> "WARC-Record-ID: <urn:uuid:%s>\r\n" +
546	> "WARC-Date: %s\r\n";
547	115,119c114,119
548	< private static String DEFAULT_WARC_PATTERN = "software: %s Extractor\r\n" +
549	< "format: WARC File Format 1.0\r\n" +
550	< "conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n" +
551	< "publisher: Internet Archive\r\n" +
552	< "created: %s\r\n\r\n";
553	---
554	> private static String DEFAULT_WARC_PATTERN = "Software: crawl/1.0\r\n" +
555	> "Format: WARC File Format 1.0\r\n" +
556	> "Conformsto: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n\r\n";
557	> // +
558	> //"publisher: Internet Archive\r\n" +
559	> //"created: %s\r\n\r\n";
560	<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
561
562
563	3. To run WARC to WET, the warc needs to live on hdfs in a warc folder and there should be wet and wat folders at the same level.
564
565	For example, assume that running Autistici's crawl generated $GOPATH/bin/crawl.warc.gz
566	(default location and filename unless you pass flags to crawl CLI to control these)
567
568	a. Ensure you get crawl.warc.gz onto the vagrant VM that has the git projects for WARC-to-WET installed, recompiled with the above modifications.
569
570	b. Now, create the folder structure needed for warc-to-wet conversion:
571	hdfs dfs -mkdir /user/vagrant/warctest
572	hdfs dfs -mkdir /user/vagrant/warctest/warc
573	hdfs dfs -mkdir /user/vagrant/warctest/wet
574	hdfs dfs -mkdir /user/vagrant/warctest/wat
575
576	c. Put crawl.warc.gz into the warc folder on hfds:
577	hdfs dfs -put crawl.warc.gz /user/vagrant/warctest/warc/.
578
579	d. Finally, time to run the actual warc-to-wet conversion from ia-hadoop-tools:
580	cd ia-hadoop-tools
581	WARC_FOLDER=/user/vagrant/warctest/warc
582	$HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz $WARC_FOLDER/crawl*.warc.gz
583
584	More meaningful when the WARC_FOLDER contains multiple *.warc.gz files,
585	as the above will use map-reduce to generate a .warc.wet.gz file in the output wet folder for each input .warc.gz file.
586
587	e. Copy the generated wet files across from /user/vagrant/warctest/wet/:
588
589	(cd /vagrant or else
590	cd /home/vagrant
591	)
592	hdfs dfs -get /user/vagrant/warctest/wet/crawl.warc.wet.gz .
593
594	or, when dealing with multiple input warc files, we'll have multiple wet files:
595	hdfs dfs -get /user/vagrant/warctest/wet/*.warc.wet.gz
596
597
598	f. Now can view the contents of the WET files to confirm they are what we want:
599	gunzip crawl.warc.wet.gz
600	zless crawl.warc.wet
601
602	The wet file contents should look good now: the web pages as WET records without html tags.
603
604
605	----------------------------------------------------
606	I. Setting up Nutch v2 on its own Vagrant VM machine
607	----------------------------------------------------
608	1. Untar vagrant-for-nutch2.tar.gz
609	2. Follow the instructions below. A copy is also in vagrant-for-nutch2/GS_README.txt
610
611	---
612	REASONING FOR THE NUTCH v2 SPECIFIC VAGRANT VM:
613	---
614	We were able to get nutch v1 working on a regular machine.
615
616	From a few pages online starting with https://stackoverflow.com/questions/33354460/nutch-clone-website, it appeared that "./bin/nutch fetch -all" was the nutch command to mirror a web site. Nutch v2 introduced the -all flag to the ./bin/nutch fetch command. And nutch v2 required HBase which presupposes hadoop.
617
618	Our vagrant VM for commoncrawl had an incompatible version of HBase but this version was needed for that VM's version of hadoop and spark. So Dr Bainbridge came up with the idea of having a separate Vagrant VM for Nutch v2 which would have the version of HBase it needed and a Hadoop version matching that. Compatible versions with nutch 2.3.1 are mentioned at https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
619	(Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.)
620
621	---
622	Vagrant VM for Nutch2
623	---
624	This vagrant virtual machine is based on https://github.com/martinprobson/vagrant-hadoop-hive-spark
625
626	However:
627	- It comes with the older versions of hadoop 2.5.2 and hbase 0.98.21, and no spark or hive or other packages.
628	- the VM is called node2 with IP 10.211.55.102 (instead of node1 with IP 10.211.55.101)
629	- Since not all packages are installed, fewer ports needed forwarding. And they're forwarded to portnumber+2 to not conflict with any vagrant VM that used the original vagrant image's forwarded port numbers.
630	- scripts/common.sh uses HBASE_ARCHIVE of the form "hbase-${HBASE_VERSION}-hadoop2-bin.tar.gz" (the -hadoop2 suffix is specific to v0.98.21)
631	- and hbase gets installed as /usr/local/hbase-$HBASE_VERSION-hadoop2 (again with the additional -hadoop2 suffix specific to v0.98.21) in scripts/setup-hbase.sh, so the symbolic link creation there needed to refer to a path of this form.
632
633	INSTRUCTIONS:
634	a. mostly follow the "Getting Started" instructions at https://github.com/martinprobson/vagrant-hadoop-hive-spark
635	b. but after step 3, replace the github cloned Vagrantfile, scripts and resources folders with their modified counterparts included in the zip file that can be downloaded by visiting http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz.
636	c. wherever the rest of that git page refers to "node1", IP "10.211.55.101" and specific port numbers, use instead "node2", IP "10.211.55.102" and the forwarded port numbers in the customised Vagrantfile.
637	If there's already a node2/if IP "10.211.55.102" vagrant VM set up, then adjust all files in the git cloned vagrant vm already modified by the contents of this vagrant-for-nutch2 folder as follows:
638	- increment all occurrences of node2 and "10.211.55.102" to node3 and IP "10.211.55.103", if not already taken, and
639	- in the Vagrantfile increment forwarded ports by another 2 or so from the highest port number values already in use by other vagrant VMs.
640	d. After doing "vagrant up --provider=virtualbox" to create the VM, do "vagrant ssh" or "vagrant ssh node<#>" (e.g. "vagrant ssh node2") for step 8 in the "Getting Started" section.
641	e. Inside the VM, install emacs, maven, firefox:
642
643	sudo apt-get install emacs
644
645	sudo apt update
646	sudo apt install maven
647
648	sudo apt-get -y install firefox
649
650	f. We set up nutch 2.3.1, which can be downloaded from https://archive.apache.org/dist/nutch/2.3.1/, as that version worked as per the nutch2 tutorial instructions with the configuration of specific versions of hadoop, hbase and gora for the vagrant VM described here.
651
652	After untarring the nutch 2.3.1 source tarball,
653	1. move apache-nutch-2.3.1/conf/regex-urlfilter.txt to apache-nutch-2.3.1/conf/regex-urlfilter.txt.orig
654	2. download the two files nutch-site.xml and regex-urlfilter.GS_TEMPLATE from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/conf
655	and put them into the apache-nutch-2.3.1/conf folder.
656	3. Then continue following the nutch tutorial 2 instructions at https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial to set up nutch2 (and apache-solr, if needed, but I didn't install apache-solr for nutch v2).
657	- nutch-site.xml has already been configured to do as much optimisation and speeding up of the crawling as we know about concerning nutch
658	- for each site that will be crawled, regex-urlfilter.GS_TEMPLATE will get copied as the live "regex-urlfilter.txt" file, and lines of regex filters will be appended to its end.
659
660	------------------------------------------------------------------------
661	J. Automated crawling with Nutch v2.3.1 and post-processing
662	------------------------------------------------------------------------
663	1. When you're ready to start crawling with Nutch 2.3.1,
664	- copy the batchcrawl.sh file (from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts) into the vagrant machine at top level. Make the script executable.
665	- copy the to_crawl.tar.gz file containing the "to_crawl" folder (generated by CCWETProcessor.java running of the common-crawl downloaded data where MRI was the primary language) and put it into the vagrant machine at toplevel.
666	- run batchcrawl.sh on a site or range of sites not yet crawled, e.g.
667	./batchcrawl.sh 00485-00500
668
669	2. When crawling is done, the above will have generated the "crawled" folder containing a subfolder for each of the crawled sites, e.g. subfolders 00485 to 00500. Each crawled site folder will contain a dump.txt with the text output of the site's web pages. The "crawled" folder with site subfolders each containing a dump.txt file can be processed with NutchTextDumpProcessor.java.
670
671
672	------------------------------------------------------------------------
673	K. Sending the crawled data into mongodb with NutchTextDumpProcessor.java
674	------------------------------------------------------------------------
675	1. The crawled folder should contain all the batch crawls done with nutch (section J above).
676
677	2. Set up mongodb connection properties in conf/config.properties
678	By default, the mongodb database name is configured to be ateacrawldata.
679
680	3. Create a mongodb database by the specified name. A database named "ateacrawldata" to be created, unless the default db name is changed.
681
682	4. Set up the environment and compile NutchTextDumpProcessor:
683	cd maori-lang-detection/apache-opennlp-1.9.1
684	export OPENNLP_HOME=`pwd`
685	cd maori-lang-detection/src
686
687	javac -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpToMongoDB.java
688
689	4. Pass the crawled folder to NutchTextDumpProcessor:
690	java -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpToMongoDB /PATH/TO/crawled
691
692	5. It may take 1.5 hours or so to ingest the approximately 1450 crawled sites' data into mongodb.
693
694	6. Launch the Robo 3T (version 1.3 is one we tested) MongoDB client. Use it to connect to MongoDB's "ateacrawldata" database.
695	Now you can run queries.
696
697
698	Here are most of the important MongoDB queries I ran, and the shorter answers.
699	# Num websites
700	db.getCollection('Websites').find({}).count()
701	1445
702
703	# Num webpages
704	db.getCollection('Webpages').find({}).count()
705	117496
706
707	# Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI)
708	db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
709	361
710
711	# Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
712	db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
713	868
714
715	# Obviously, the union of the above two will be identical to numPagesContainingMRI:
716	db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
717	868
718
719	# Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
720	db.getCollection('Webpages').find({isMRI:true}).count()
721	7818
722
723	# Number of pages that contain any number of MRI sentences
724	db.getCollection('Webpages').find({containsMRI: true}).count()
725	20371
726
727	# Number of sites with crawled web pages that have URLs containing /mi(/) OR http(s)://mi.*
728	db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
729	670
730
731	# Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.*
732	# in any of its crawled webpage urls
733	db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
734	656
735
736	# 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ
737	14
738
739	PROJECTION QUERIES:
740	# For all the sites that do not originate in NZ, list their country codes (geoLocationCountryCode
741	# field) and the urlContainsLangCodeInPath field
742
743	db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInPath: 1})
744
745
746	AGGREGATION QUERIES - the results of important aggregate queries here
747	can be found in the associated mongodb-data/counts*.json files.
748
749	# count of country codes for all sites
750	db.Websites.aggregate([
751
752	{ $unwind: "$geoLocationCountryCode" },
753	{
754	$group: {
755	_id: "$geoLocationCountryCode",
756	count: { $sum: 1 }
757	}
758	},
759	{ $sort : { count : -1} }
760	]);
761
762	# count of country codes for sites that have at least one page detected as MRI
763
764	db.Websites.aggregate([
765	{
766	$match: {
767	numPagesInMRI: {$gt: 0}
768	}
769	},
770	{ $unwind: "$geoLocationCountryCode" },
771	{
772	$group: {
773	_id: {$toLower: '$geoLocationCountryCode'},
774	count: { $sum: 1 }
775	}
776	},
777	{ $sort : { count : -1} }
778	]);
779
780	# count of country codes for sites that have at least one page containing at least one sentence detected as MRI
781	db.Websites.aggregate([
782	{
783	$match: {
784	numPagesContainingMRI: {$gt: 0}
785	}
786	},
787	{ $unwind: "$geoLocationCountryCode" },
788	{
789	$group: {
790	_id: {$toLower: '$geoLocationCountryCode'},
791	count: { $sum: 1 }
792	}
793	},
794	{ $sort : { count : -1} }
795	]);
796
797
798	# ATTEMPT TO FILTER OUT LIKELY AUTO-TRANSLATED SITES
799	# Get a count of all non-NZ (or .nz TLD) sites that don't have /mi(/) or http(s)://mi.*
800	# in the URL path of any crawled web pages of the site
801	db.getCollection('Websites').find(
802	{$and: [
803	{numPagesContainingMRI: {$gt: 0}},
804	{geoLocationCountryCode: {$ne: "NZ"}},
805	{domain: {$not: /.nz$/}},
806	{urlContainsLangCodeInPath: {$ne: true}}
807	]}).count()
808
809	220
810
811	# Aggregate: count by country codes of non-NZ related sites that
812	# don't have the language code in the URL path on any crawled pages of the site
813
814	db.Websites.aggregate([
815	{
816	$match: {
817	$and: [
818	{numPagesContainingMRI: {$gt: 0}},
819	{geoLocationCountryCode: {$ne: "NZ"}},
820	{domain: {$not: /.nz$/}},
821	{urlContainsLangCodeInPath: {$ne: true}}
822	]
823	}
824	},
825	{ $unwind: "$geoLocationCountryCode" },
826	{
827	$group: {
828	_id: {$toLower: '$geoLocationCountryCode'},
829	count: { $sum: 1 },
830	domain: { $addToSet: '$domain' }
831	}
832	},
833	{ $sort : { count : -1} }
834	]);
835
836	The above query contains "domain: { $addToSet: '$domain' }"
837	which adds the list of matching domains for each country code
838	to the output of the aggregate result list.
839	This is useful as I'll be inspecting these manually to ensure they're not
840	auto-translated to further reduce the list if necessary.
841
842	For each resulting domain, I can then inspect that website's pages in the Webpages
843	mongodb collection for whether those pages are relevant or auto-translated with a query
844	of the following form. This example works with the sample site URL https://www.lexilogos.com
845
846	db.getCollection('Webpages').find({URL:/lexilogos\.com/, mriSentenceCount: {$gt: 0}})
847
848
849	In inspecting Australian sites in the result list, I noticed that one that should not be
850	excluded from the output was https://www.kiwiproperty.com. The TLD is not .nz,
851	and the site originates in Australia, not NZ, but it's still a site of NZ content.
852	This will be an important consideration when constructing some aggregate queries further below.
853
854
855	# Count of websites that have at least 1 page containing at least one sentence detected as MRI
856	# AND which websites have mi in the URL path:
857
858	db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
859
860	491
861
862
863	# The websites that have some MRI detected AND which are either in NZ or with NZ TLD
864	# or (so if they're from overseas) don't contain /mi or mi.* in URL path:
865
866	db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}]}).count()
867	396
868
869	Include Australia, to get the valid "kiwiproperty.com" website included in the result list:
870
871	db.getCollection('Websites').find({$and: [
872	{numPagesContainingMRI: {$gt: 0}},
873	{$or: [{geoLocationCountryCode: /(NZ\|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
874	]}).count()
875
876	397
877
878	# aggregate results by a count of country codes
879	db.Websites.aggregate([
880	{
881	$match: {
882	$and: [
883	{numPagesContainingMRI: {$gt: 0}},
884	{$or: [{geoLocationCountryCode: /(NZ\|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
885	]
886	}
887	},
888	{ $unwind: "$geoLocationCountryCode" },
889	{
890	$group: {
891	_id: {$toLower: '$geoLocationCountryCode'},
892	count: { $sum: 1 }
893	}
894	},
895	{ $sort : { count : -1} }
896	]);
897
898
899	# Just considering those sites outside NZ or not with .nz TLD:
900
901	db.Websites.aggregate([
902	{
903	$match: {
904	$and: [
905	{geoLocationCountryCode: {$ne: "NZ"}},
906	{domain: {$not: /\.nz/}},
907	{numPagesContainingMRI: {$gt: 0}},
908	{$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
909	]
910	}
911	},
912	{ $unwind: "$geoLocationCountryCode" },
913	{
914	$group: {
915	_id: {$toLower: '$geoLocationCountryCode'},
916	count: { $sum: 1 },
917	domain: { $addToSet: '$domain' }
918	}
919	},
920	{ $sort : { count : -1} }
921	]);
922
923
924	# counts by country code excluding NZ related sites
925
926	db.getCollection('Websites').find({$and: [
927	{geoLocationCountryCode: {$ne: "NZ"}},
928	{domain: {$not: /\.nz/}},
929	{numPagesContainingMRI: {$gt: 0}},
930	{$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
931	]}).count()
932
933	221 websites
934
935
936	# But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
937	db.getCollection('Websites').find({$and: [
938	{numPagesContainingMRI: {$gt: 0}},
939	{$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
940	]}).count()
941
942	176
943
944	(Total is 221+176 = 397, which adds up).
945
946	# Get the count (and domain listing) output put under a hardcoded _id of "nz":
947	db.Websites.aggregate([
948	{
949	$match: {
950	$and: [
951	{numPagesContainingMRI: {$gt: 0}},
952	{$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
953	]
954	}
955	},
956	{ $unwind: "$geoLocationCountryCode" },
957	{
958	$group: {
959	_id: "nz",
960	count: { $sum: 1 },
961	domain: { $addToSet: '$domain' }
962	}
963	},
964	{ $sort : { count : -1} }
965	]);
966
967
968	# Manually inspected shortlist of the 221 non-NZ websites to weed out those that aren't MRI (weeding out those misdetected as MRI, autotranslated or just contain placenames etc), and adding the 176 NZ on top:
969
970	MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY:
971	NZ: 176
972	US: 25
973	AU: 3
974	DE: 2
975	DK: 2
976	BG: 1
977	CZ: 1
978	ES: 1
979	FR: 1
980	IE: 1
981	TOTAL: 213
982
983	Manually created counts.json file for above with name "6counts_nonProductSites1_manualShortlist.json"
984
985	--------------------------------------------------------
986	APPENDIX: Legend of mongodb-data folder's contents
987	--------------------------------------------------------
988	1. allCrawledSites: all sites from CommonCrawl where the content-language=MRI, which we then crawled with Nutch with depth=10. Some obvious auto-translated websites were skipped.
989
990	2. sitesWithPagesInMRI: those sites of point 1 above which contained one or more pages that openNLP detected as MRI as primary language
991
992	3. sitesWithPagesContainingMRI.json: those sites of point 1 where one or more pages containing at least one "sentence" for which the primary language detected by OpenNLP was MRI
993
994	4. tentativeNonProductSites: sites of point 3 excluding those non-NZ sites that had "mi." or "/mi" in the URL path
995
996	5. tentativeNonProductSites1: similar to point 4, but "NZ sites" in this set were not just those that were detected as originating in NZ (hosted on NZ servers?) but also any with a TLD of .nz regardless of site's country of origin.
997
998	6. nonProductSites1_manualShortlist: based on point 5, but manually inspected all the non-NZ sites for any that were not actually sources of MRI content. For example, sites where the content was in a different language misdetected by openNLP (and commoncrawl's language detection) as MRI, or any further sites that were autotranslated, sites where the "MRI" detected content were photos captioned with NZ placenames constituting the "sentence(s)" detected as being MRI.
999
1000
1001	a. All .json files that contain the "counts_" prefix are the counts by country code for each of the above variants. The comments section at the top of each such counts_.json file usually contains the mongodb query used to generate the json content of the file.
1002
1003	b. All .json files that contain "geojson-features_" and "multipoint_" prefix for each of the above variants are generated by running org/greenstone/atea/CountryCodeCountsMapData.java on the counts_.json file.
1004
1005	Run as:
1006	cd maori-lang-detection/src
1007	java -cp ".:../conf:../lib/" org/greenstone/atea/CountryCodeCountsMapData ../mongodb-data/[1-6]counts.json
1008
1009	This will then generate the multipoint_.json and geojson-features_.json files for any of the above 1-6 variants of the input counts json file.
1010
1011	c. All .png files that contain the "map_" prefix for each of the above variants were screenshots of the map generated by http://geojson.tools/ for each geojson-features_.json file.
1012	GIMP was used to crop each screenshot to the area of interest.
1013
1014
1015	--------------------------------------------------------
1016	APPENDIX: Reading data from hbase tables and backing up hbase
1017	--------------------------------------------------------
1018
1019	* Backing up HBase database:
1020	https://blogs.msdn.microsoft.com/data_otaku/2016/12/21/working-with-the-hbase-import-and-export-utility/
1021
1022	* From an image at http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/
1023	to see the contents of a table, inside hbase shell, type:
1024
1025	scan 'tablename'
1026
1027	e.g. scan '01066_webpage' and hit enter.
1028
1029
1030	To list tables and see their "column families" (I don't yet understand what this is):
1031
1032	hbase shell
1033	hbase(main):001:0> list
1034
1035	hbase(main):002:0> describe '01066_webpage'
1036	Table 01066_webpage is ENABLED
1037	01066_webpage
1038	COLUMN FAMILIES DESCRIPTION
1039	{NAME => 'f', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK
1040	CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1041	{NAME => 'h', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK
1042	CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1043	{NAME => 'il', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC
1044	KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1045	{NAME => 'mk', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC
1046	KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1047	{NAME => 'mtdt', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BL
1048	OCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1049	{NAME => 'ol', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOC
1050	KCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1051	{NAME => 'p', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK
1052	CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1053	{NAME => 's', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCK
1054	CACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1055	8 row(s) in 0.1180 seconds
1056
1057
1058	-----------------------EOF------------------------
1059

Note: See TracBrowser for help on using the repository browser.

Download in other formats: