Changeset 33540

Show
Ignore:
Timestamp:
01.10.2019 21:40:33 (2 weeks ago)
Author:
ak19
Message:

Since I wasn't getting further with nutch 2 to grab an entire site, I am committing the documentation of work I did today on using Autistici's crawl: steps on installing go (golang) to install Autistici's crawl and then how to compile it and the basics of running it. I got further than what I've documented, as there were also code modifications necessary in the warc to wet conversion used for commoncrwal, which eventually succeeded in successfully converting the slightly different warc.gz file produced by Autistici's crawl to wet, but I got that working and will commit that separately hereafter.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33537 r33540  
    33 
    44https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps 
     5 
     6 
     7ALTERNATIVES TO NUTCH - looking for site mirroring capabilities 
     8https://anarc.at/services/archive/web/  
     9    Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go: 
     10    https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f     
     11    https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/ 
     12        https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd 
     13    https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"] 
     14https://alternativeto.net/software/apache-nutch/ 
     15https://alternativeto.net/software/wget/ 
     16https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal 
     17https://github.com/ArchiveTeam/wpull 
     18 
     19--- 
     20Autistici crawl: 
     21--- 
     221. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f 
     232. Create go environment: 
     24#!/bin/bash 
     25# environment vars for golang                
     26export GOROOT=/usr/local/go 
     27export GOPATH=$HOME/go 
     28export PATH=$GOPATH/bin:$GOROOT/bin:$PATH 
     293. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage. 
     30 
     31These steps work: 
     32 
     33cd $GOPATH 
     34mkdir bin 
     35mkdir src 
     36cd src 
     37 
     384. Since trying to go install the crawl url didn't work 
     39https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main 
     40 
     41vagrant@node2:~/go/src$ 
     42  mkdir -p git.autistici.org/ale 
     43  cd git.autistici.org/ale 
     44  git clone https://git.autistici.org/ale/crawl.git 
     45 
     46[Now can run the install command in README.md:] 
     47  cd $GOPATH/src 
     48  go install git.autistici.org/ale/crawl/cmd/crawl 
     49 
     50Now we should have a $GOPATH/bin folder containing the "crawl" binary 
     51 
     525. Run a crawl: 
     53  cd $GOPATH/bin 
     54  ./crawl https://www.cs.waikato.ac.nz/~davidb/ 
     55 
     56which downloads the site and puts the warc file into the $GOPATH/bin folder. 
     57 
     58More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md 
     59 
     606. To view the RAW contents of a WARC file: 
     61https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives 
     62 
     63zless <warc-file-name> 
     64 
     65zless already installed on vagrant file 
     66 
     67 
     68Issues converting to Wet: 
     69Not the correct warc format: missing elements in header, ordering different. 
     70But WET is an official format, not CommonCrawl specific: 
     71 
     72https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis 
     73WET (parsed text) 
     74 
     75WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format. 
    576 
    677-----------