Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33541

Timestamp:

2019-10-01T22:27:03+13:00 (5 years ago)

Author:

ak19

Message:

hdfs-cc-work/GS_README.txt now contains the complete instructions to use Autistici crawl to download a website (as WARC file) as well as now also the instructions to convert those WARCs to WET. 2. Moved the first part out of MoreReading/crawling-Nutch.txt. 3. Adding patched WARC-to-WET files for the gitprojects ia-web-commons and ia-hadoop-tools to successfully do the WARC-to-WET processing on WARC files generated by Austistici crawl. (Worked on Dr Bainbridge's home page site as a test. Not tried any other site yet, as I wanted to get the work flow from crawl to WET working.)

Location:

gs3-extensions/maori-lang-detection

Files:

: 2 added
: 2 edited

MoreReading/crawling-Nutch.txt (modified) (2 diffs)
hdfs-cc-work/GS_README.TXT (modified) (2 diffs)
hdfs-cc-work/patches/GZRangeClient.java (added)
hdfs-cc-work/patches/WATExtractorOutput.java (added)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

-              r33540
+              r33541
 https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
-ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
-https://anarc.at/services/archive/web/
-    Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
-    https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
-    https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
-        https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
-    https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
-https://alternativeto.net/software/apache-nutch/
-https://alternativeto.net/software/wget/
-https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
-https://github.com/ArchiveTeam/wpull
----
-Autistici crawl:
----
-. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
-. Create go environment:
-#!/bin/bash
-# environment vars for golang
-export GOROOT=/usr/local/go
-export GOPATH=$HOME/go
-export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
-. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
-These steps work:
-cd $GOPATH
-mkdir bin
-mkdir src
-cd src
-. Since trying to go install the crawl url didn't work
-https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
-vagrant@node2:~/go/src$
-  mkdir -p git.autistici.org/ale
-  cd git.autistici.org/ale
-  git clone https://git.autistici.org/ale/crawl.git
-[Now can run the install command in README.md:]
-  cd $GOPATH/src
-  go install git.autistici.org/ale/crawl/cmd/crawl
-Now we should have a $GOPATH/bin folder containing the "crawl" binary
-. Run a crawl:
-  cd $GOPATH/bin
-  ./crawl https://www.cs.waikato.ac.nz/~davidb/
-which downloads the site and puts the warc file into the $GOPATH/bin folder.
-More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
-. To view the RAW contents of a WARC file:
-https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
-zless <warc-file-name>
-zless already installed on vagrant file
-Issues converting to Wet:
-Not the correct warc format: missing elements in header, ordering different.
-But WET is an official format, not CommonCrawl specific:
-https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
-WET (parsed text)
-WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format.
 -----------
 …
+Solution to get a working nutch2:
+get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz
+And follow the instructions in my README file in there.
+---------------------------------------------------------------------
+ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
+---------------------------------------------------------------------
+=> https://anarc.at/services/archive/web/
+    Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
+    https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
+    https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
+        To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
+    https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
+https://alternativeto.net/software/apache-nutch/
+https://alternativeto.net/software/wget/
+https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
+https://github.com/ArchiveTeam/wpull

gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

-              r33539
+              r33541
 F.  Setup warc-to-wet tools (git projects)
 G.  Getting and running our scripts
+---
+H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
 ----------------------------------------
 …
 Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
+-----------------------------------
+H. Austici crawl
+-----------------------------------
+Austici's crawl: CLI to download web sites as WARCs, features basics to avoid crawler taps.
+Out of several software to do site mirroring, Autistici's "crawl" seemed promising:
+https://anarc.at/services/archive/web/
+- CLI.
+- Can download a website quite simply, though flags for additional settings are available.
+- Coded to prevent common traps.
+- Downloads website as WARC file
+- Now I have the WARC to WET process working for the WARC file it produced for the usual test site (Dr Bainbridge's home page)
+Need to have Go installed in order to install and run Autistici's crawl.
+Not a problem, because I can do it on the remote machine (which also hosts the hdfs) where I have sudo powers.
+INSTRUCTIONS
+. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
+. Create go environment:
+#!/bin/bash
+# environment vars for golang
+export GOROOT=/usr/local/go
+export GOPATH=$HOME/go
+export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
+. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
+These steps work:
+cd $GOPATH
+mkdir bin
+mkdir src
+cd src
+. Since trying to go install the crawl url didn't work
+https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
+vagrant@node2:~/go/src$
+  mkdir -p git.autistici.org/ale
+  cd git.autistici.org/ale
+  git clone https://git.autistici.org/ale/crawl.git
+[Now can run the install command in README.md:]
+  cd $GOPATH/src
+  go install git.autistici.org/ale/crawl/cmd/crawl
+Now we should have a $GOPATH/bin folder containing the "crawl" binary
+. Run a crawl:
+  cd $GOPATH/bin
+  ./crawl https://www.cs.waikato.ac.nz/~davidb/
+which downloads the site and puts the warc file into the $GOPATH/bin folder.
+More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
+. To view the RAW contents of a WARC file:
+https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
+zless <warc-file-name>
+zless already installed on vagrant file
+-----------------------------------------------------------------------------------------------
+How to run warc-to-wet conversion on sites downloaded as WARCs by Austici's "crawl"
+-----------------------------------------------------------------------------------------------
+ISSUES CONVERTING WARC to WET:
+---
+WARC files produced by Autistici crawl are of a somewhat different format to CommonCrawl WARCs.
+- missing elements in header
+- different header elements
+- ordering different (if that matters)
+But WET is an official format, not CommonCrawl specific, as indicated by
+https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
+"WET (parsed text)
+WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format."
+So must be possible to get WARC to WET conversion used for CommonCrawl data to work on Autistici crawl's WARC files.
+RESOLUTION:
+---
+I made changes to 2 java source files in the 2 github projects ia-web-commons and ia-hadoop-tools, which we use for the WARC to WET processing of CommonCrawl data. These gitprojects (with modifications for commoncrawl) are already on http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/gitprojects.
+The changed files are as follows:
+. patches/WATExtractorOutput.java
+   put into ia-web-commons/src/main/java/org/archive/extract
+   after renaming existing to .orig
+THEN RECOMPILE ia-web-commons with:
+   mvn install
+. patches/GZRangeClient.java
+   put into ia-hadoop-tools/src/main/java/org/archive/server
+   after renaming existing to .orig
+THEN RECOMPILE ia-hadoop-tools with:
+   mvn package
+Make sure to first compile ia-web-commons, then ia-hadoop-tools.
+The modifications made to the above 2 files are as follows:
+>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
+. ia-web-commons/src/main/java/org/archive/extract/WATExtractorOutput.java
+[diff src/main/java/org/archive/extract/WATExtractorOutput.orig src/main/java/org/archive/extract/WATExtractorOutput.java]
+,163c162,163
+<           targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Filename");
+<       } else {
+---
+>           targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Warcinfo-ID");
+>       } else {
+. ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java
+[diff src/main/java/org/archive/server/GZRangeClient.orig src/main/java/org/archive/server/GZRangeClient.java]
+,83c76,82
+<       "WARC/1.0\r\n" +
+<       "WARC-Type: warcinfo\r\n" +
+<       "WARC-Date: %s\r\n" +
+<       "WARC-Filename: %s\r\n" +
+<       "WARC-Record-ID: <urn:uuid:%s>\r\n" +
+<       "Content-Type: application/warc-fields\r\n" +
+<       "Content-Length: %d\r\n\r\n";
+<
+---
+>       "WARC/1.0\r\n" +
+>       "Content-Type: application/warc-fields\r\n" +
+>       "WARC-Type: warcinfo\r\n" +
+>       "WARC-Warcinfo-ID: <urn:uuid:%s>\r\n" +
+>       "Content-Length: %d\r\n\r\n" +
+>       "WARC-Record-ID: <urn:uuid:%s>\r\n" +
+>       "WARC-Date: %s\r\n";
+,119c114,119
+<   private static String DEFAULT_WARC_PATTERN = "software: %s Extractor\r\n" +
+<   "format: WARC File Format 1.0\r\n" +
+<   "conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n" +
+<   "publisher: Internet Archive\r\n" +
+<   "created: %s\r\n\r\n";
+---
+>   private static String DEFAULT_WARC_PATTERN = "Software: crawl/1.0\r\n" +
+>   "Format: WARC File Format 1.0\r\n" +
+>       "Conformsto: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n\r\n";
+>     // +
+>     //"publisher: Internet Archive\r\n" +
+>     //"created: %s\r\n\r\n";
+<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+. To run WARC to WET, the warc needs to live on hdfs in a warc folder and there should be wet and wat folders at the same level.
+For example, assume that running Autistici's crawl generated $GOPATH/bin/crawl.warc.gz
+(default location and filename unless you pass flags to crawl CLI to control these)
+a. Ensure you get crawl.warc.gz onto the vagrant VM with the WARC to WET git projects installed, recompiled with the above modifications.
+b. Now, create the folder structure needed for warc-to-wet conversion:
+   hdfs dfs -mkdir /user/vagrant/warctest
+   hdfs dfs -mkdir /user/vagrant/warctest/warc
+   hdfs dfs -mkdir /user/vagrant/warctest/wet
+   hdfs dfs -mkdir /user/vagrant/warctest/wat
+c. Put crawl.warc.gz into the warc folder on hfds:
+   hdfs dfs -put crawl.warc.gz /user/vagrant/warctest/warc/.
+d. Finally, time to run the actual warc-to-wet conversion from ia-hadoop-tools:
+   cd ia-hadoop-tools
+   WARC_FOLDER=/user/vagrant/warctest/warc
+   $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz $WARC_FOLDER/crawl*.warc.gz
+More meaningful when the WARC_FOLDER contains multiple *.warc.gz files,
+as the above will use map-reduce to generate the *.warc.wet.gz files in the output wet folder.
+e. Copy the generated wet files across from /user/vagrant/warctest/wet/:
+   (cd /vagrant or else
+   cd /home/vagrant
+   )
+   hdfs dfs -get /user/vagrant/warctest/wet/crawl.warc.wet.gz .
+or, when dealing with multiple input warc files, we'll have multiple wet files:
+    hdfs dfs -get /user/vagrant/warctest/wet/*.warc.wet.gz
+f. Now can view the contents of the WET files to confirm they are what we want:
+   gunzip crawl.warc.wet.gz
+   zless crawl.warc.wet
+The wet file contents should look good now: the web pages as WET records without html tags.
 -----------------------EOF------------------------

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33541

Legend:

gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

Download in other formats: