Ignore:
Timestamp:
2019-10-01T21:40:33+13:00 (4 years ago)
Author:
ak19
Message:

Since I wasn't getting further with nutch 2 to grab an entire site, I am committing the documentation of work I did today on using Autistici's crawl: steps on installing go (golang) to install Autistici's crawl and then how to compile it and the basics of running it. I got further than what I've documented, as there were also code modifications necessary in the warc to wet conversion used for commoncrwal, which eventually succeeded in successfully converting the slightly different warc.gz file produced by Autistici's crawl to wet, but I got that working and will commit that separately hereafter.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33537 r33540  
    33
    44https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
     5
     6
     7ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
     8https://anarc.at/services/archive/web/
     9    Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
     10    https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f   
     11    https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
     12        https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
     13    https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
     14https://alternativeto.net/software/apache-nutch/
     15https://alternativeto.net/software/wget/
     16https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
     17https://github.com/ArchiveTeam/wpull
     18
     19---
     20Autistici crawl:
     21---
     221. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
     232. Create go environment:
     24#!/bin/bash
     25# environment vars for golang               
     26export GOROOT=/usr/local/go
     27export GOPATH=$HOME/go
     28export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
     293. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
     30
     31These steps work:
     32
     33cd $GOPATH
     34mkdir bin
     35mkdir src
     36cd src
     37
     384. Since trying to go install the crawl url didn't work
     39https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
     40
     41vagrant@node2:~/go/src$
     42  mkdir -p git.autistici.org/ale
     43  cd git.autistici.org/ale
     44  git clone https://git.autistici.org/ale/crawl.git
     45
     46[Now can run the install command in README.md:]
     47  cd $GOPATH/src
     48  go install git.autistici.org/ale/crawl/cmd/crawl
     49
     50Now we should have a $GOPATH/bin folder containing the "crawl" binary
     51
     525. Run a crawl:
     53  cd $GOPATH/bin
     54  ./crawl https://www.cs.waikato.ac.nz/~davidb/
     55
     56which downloads the site and puts the warc file into the $GOPATH/bin folder.
     57
     58More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
     59
     606. To view the RAW contents of a WARC file:
     61https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
     62
     63zless <warc-file-name>
     64
     65zless already installed on vagrant file
     66
     67
     68Issues converting to Wet:
     69Not the correct warc format: missing elements in header, ordering different.
     70But WET is an official format, not CommonCrawl specific:
     71
     72https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
     73WET (parsed text)
     74
     75WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format.
    576
    677-----------
Note: See TracChangeset for help on using the changeset viewer.