Changeset 33540

01.10.2019 21:40:33 (2 weeks ago)

Since I wasn't getting further with nutch 2 to grab an entire site, I am committing the documentation of work I did today on using Autistici's crawl: steps on installing go (golang) to install Autistici's crawl and then how to compile it and the basics of running it. I got further than what I've documented, as there were also code modifications necessary in the warc to wet conversion used for commoncrwal, which eventually succeeded in successfully converting the slightly different warc.gz file produced by Autistici's crawl to wet, but I got that working and will commit that separately hereafter.

1 modified


  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33537 r33540  
     7ALTERNATIVES TO NUTCH - looking for site mirroring capabilities 
     9    Autistici's crawl [] needs Go: 
     13 [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"] 
     20Autistici crawl: 
     221. Install go 1.11 by following instructions at 
     232. Create go environment: 
     25# environment vars for golang                
     26export GOROOT=/usr/local/go 
     27export GOPATH=$HOME/go 
     28export PATH=$GOPATH/bin:$GOROOT/bin:$PATH 
     293. The instructions on installing are not very clear and don't work as is at this stage. 
     31These steps work: 
     33cd $GOPATH 
     34mkdir bin 
     35mkdir src 
     36cd src 
     384. Since trying to go install the crawl url didn't work 
     42  mkdir -p 
     43  cd 
     44  git clone 
     46[Now can run the install command in] 
     47  cd $GOPATH/src 
     48  go install 
     50Now we should have a $GOPATH/bin folder containing the "crawl" binary 
     525. Run a crawl: 
     53  cd $GOPATH/bin 
     54  ./crawl 
     56which downloads the site and puts the warc file into the $GOPATH/bin folder. 
     58More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in 
     606. To view the RAW contents of a WARC file: 
     63zless <warc-file-name> 
     65zless already installed on vagrant file 
     68Issues converting to Wet: 
     69Not the correct warc format: missing elements in header, ordering different. 
     70But WET is an official format, not CommonCrawl specific: 
     73WET (parsed text) 
     75WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format.