- Timestamp:
- 2019-10-01T22:27:03+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt
r33540 r33541 4 4 https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps 5 5 6 7 ALTERNATIVES TO NUTCH - looking for site mirroring capabilities8 https://anarc.at/services/archive/web/9 Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:10 https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f11 https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/12 https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd13 https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]14 https://alternativeto.net/software/apache-nutch/15 https://alternativeto.net/software/wget/16 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal17 https://github.com/ArchiveTeam/wpull18 19 ---20 Autistici crawl:21 ---22 1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f23 2. Create go environment:24 #!/bin/bash25 # environment vars for golang26 export GOROOT=/usr/local/go27 export GOPATH=$HOME/go28 export PATH=$GOPATH/bin:$GOROOT/bin:$PATH29 3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.30 31 These steps work:32 33 cd $GOPATH34 mkdir bin35 mkdir src36 cd src37 38 4. Since trying to go install the crawl url didn't work39 https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main40 41 vagrant@node2:~/go/src$42 mkdir -p git.autistici.org/ale43 cd git.autistici.org/ale44 git clone https://git.autistici.org/ale/crawl.git45 46 [Now can run the install command in README.md:]47 cd $GOPATH/src48 go install git.autistici.org/ale/crawl/cmd/crawl49 50 Now we should have a $GOPATH/bin folder containing the "crawl" binary51 52 5. Run a crawl:53 cd $GOPATH/bin54 ./crawl https://www.cs.waikato.ac.nz/~davidb/55 56 which downloads the site and puts the warc file into the $GOPATH/bin folder.57 58 More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md59 60 6. To view the RAW contents of a WARC file:61 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives62 63 zless <warc-file-name>64 65 zless already installed on vagrant file66 67 68 Issues converting to Wet:69 Not the correct warc format: missing elements in header, ordering different.70 But WET is an official format, not CommonCrawl specific:71 72 https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis73 WET (parsed text)74 75 WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format.76 6 77 7 ----------- … … 153 83 154 84 85 Solution to get a working nutch2: 86 get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz 87 And follow the instructions in my README file in there. 88 89 --------------------------------------------------------------------- 90 ALTERNATIVES TO NUTCH - looking for site mirroring capabilities 91 --------------------------------------------------------------------- 92 => https://anarc.at/services/archive/web/ 93 Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go: 94 https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f 95 https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/ 96 To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd 97 https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"] 98 https://alternativeto.net/software/apache-nutch/ 99 https://alternativeto.net/software/wget/ 100 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal 101 https://github.com/ArchiveTeam/wpull 102
Note:
See TracChangeset
for help on using the changeset viewer.