- Timestamp:
- 2019-10-01T21:40:33+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt
r33537 r33540 3 3 4 4 https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps 5 6 7 ALTERNATIVES TO NUTCH - looking for site mirroring capabilities 8 https://anarc.at/services/archive/web/ 9 Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go: 10 https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f 11 https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/ 12 https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd 13 https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"] 14 https://alternativeto.net/software/apache-nutch/ 15 https://alternativeto.net/software/wget/ 16 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal 17 https://github.com/ArchiveTeam/wpull 18 19 --- 20 Autistici crawl: 21 --- 22 1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f 23 2. Create go environment: 24 #!/bin/bash 25 # environment vars for golang 26 export GOROOT=/usr/local/go 27 export GOPATH=$HOME/go 28 export PATH=$GOPATH/bin:$GOROOT/bin:$PATH 29 3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage. 30 31 These steps work: 32 33 cd $GOPATH 34 mkdir bin 35 mkdir src 36 cd src 37 38 4. Since trying to go install the crawl url didn't work 39 https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main 40 41 vagrant@node2:~/go/src$ 42 mkdir -p git.autistici.org/ale 43 cd git.autistici.org/ale 44 git clone https://git.autistici.org/ale/crawl.git 45 46 [Now can run the install command in README.md:] 47 cd $GOPATH/src 48 go install git.autistici.org/ale/crawl/cmd/crawl 49 50 Now we should have a $GOPATH/bin folder containing the "crawl" binary 51 52 5. Run a crawl: 53 cd $GOPATH/bin 54 ./crawl https://www.cs.waikato.ac.nz/~davidb/ 55 56 which downloads the site and puts the warc file into the $GOPATH/bin folder. 57 58 More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md 59 60 6. To view the RAW contents of a WARC file: 61 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives 62 63 zless <warc-file-name> 64 65 zless already installed on vagrant file 66 67 68 Issues converting to Wet: 69 Not the correct warc format: missing elements in header, ordering different. 70 But WET is an official format, not CommonCrawl specific: 71 72 https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis 73 WET (parsed text) 74 75 WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format. 5 76 6 77 -----------
Note:
See TracChangeset
for help on using the changeset viewer.