Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33540

Timestamp:

2019-10-01T21:40:33+13:00 (5 years ago)

Author:

ak19

Message:

Since I wasn't getting further with nutch 2 to grab an entire site, I am committing the documentation of work I did today on using Autistici's crawl: steps on installing go (golang) to install Autistici's crawl and then how to compile it and the basics of running it. I got further than what I've documented, as there were also code modifications necessary in the warc to wet conversion used for commoncrwal, which eventually succeeded in successfully converting the slightly different warc.gz file produced by Autistici's crawl to wet, but I got that working and will commit that separately hereafter.

File:

: 1 edited

gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

-              r33537
+              r33540
 https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
+ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
+https://anarc.at/services/archive/web/
+    Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
+    https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
+    https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
+        https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
+    https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
+https://alternativeto.net/software/apache-nutch/
+https://alternativeto.net/software/wget/
+https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
+https://github.com/ArchiveTeam/wpull
+---
+Autistici crawl:
+---
+. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
+. Create go environment:
+#!/bin/bash
+# environment vars for golang
+export GOROOT=/usr/local/go
+export GOPATH=$HOME/go
+export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
+. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
+These steps work:
+cd $GOPATH
+mkdir bin
+mkdir src
+cd src
+. Since trying to go install the crawl url didn't work
+https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
+vagrant@node2:~/go/src$
+  mkdir -p git.autistici.org/ale
+  cd git.autistici.org/ale
+  git clone https://git.autistici.org/ale/crawl.git
+[Now can run the install command in README.md:]
+  cd $GOPATH/src
+  go install git.autistici.org/ale/crawl/cmd/crawl
+Now we should have a $GOPATH/bin folder containing the "crawl" binary
+. Run a crawl:
+  cd $GOPATH/bin
+  ./crawl https://www.cs.waikato.ac.nz/~davidb/
+which downloads the site and puts the warc file into the $GOPATH/bin folder.
+More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
+. To view the RAW contents of a WARC file:
+https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
+zless <warc-file-name>
+zless already installed on vagrant file
+Issues converting to Wet:
+Not the correct warc format: missing elements in header, ordering different.
+But WET is an official format, not CommonCrawl specific:
+https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
+WET (parsed text)
+WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format.
 -----------

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33540

Legend:

gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

Download in other formats: