Context Navigation

← Previous Change
Next Change →

crawling-Nutch.txt

Timestamp:

2019-10-01T22:27:03+13:00 (5 years ago)

Author:

ak19

Message:

hdfs-cc-work/GS_README.txt now contains the complete instructions to use Autistici crawl to download a website (as WARC file) as well as now also the instructions to convert those WARCs to WET. 2. Moved the first part out of MoreReading/crawling-Nutch.txt. 3. Adding patched WARC-to-WET files for the gitprojects ia-web-commons and ia-hadoop-tools to successfully do the WARC-to-WET processing on WARC files generated by Austistici crawl. (Worked on Dr Bainbridge's home page site as a test. Not tried any other site yet, as I wanted to get the work flow from crawl to WET working.)

File:

: 1 edited

gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

-              r33540
+              r33541
 https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
-ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
-https://anarc.at/services/archive/web/
-    Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
-    https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
-    https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
-        https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
-    https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
-https://alternativeto.net/software/apache-nutch/
-https://alternativeto.net/software/wget/
-https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
-https://github.com/ArchiveTeam/wpull
----
-Autistici crawl:
----
-. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
-. Create go environment:
-#!/bin/bash
-# environment vars for golang
-export GOROOT=/usr/local/go
-export GOPATH=$HOME/go
-export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
-. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
-These steps work:
-cd $GOPATH
-mkdir bin
-mkdir src
-cd src
-. Since trying to go install the crawl url didn't work
-https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
-vagrant@node2:~/go/src$
-  mkdir -p git.autistici.org/ale
-  cd git.autistici.org/ale
-  git clone https://git.autistici.org/ale/crawl.git
-[Now can run the install command in README.md:]
-  cd $GOPATH/src
-  go install git.autistici.org/ale/crawl/cmd/crawl
-Now we should have a $GOPATH/bin folder containing the "crawl" binary
-. Run a crawl:
-  cd $GOPATH/bin
-  ./crawl https://www.cs.waikato.ac.nz/~davidb/
-which downloads the site and puts the warc file into the $GOPATH/bin folder.
-More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
-. To view the RAW contents of a WARC file:
-https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
-zless <warc-file-name>
-zless already installed on vagrant file
-Issues converting to Wet:
-Not the correct warc format: missing elements in header, ordering different.
-But WET is an official format, not CommonCrawl specific:
-https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
-WET (parsed text)
-WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format.
 -----------
 …
+Solution to get a working nutch2:
+get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz
+And follow the instructions in my README file in there.
+---------------------------------------------------------------------
+ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
+---------------------------------------------------------------------
+=> https://anarc.at/services/archive/web/
+    Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
+    https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
+    https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
+        To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
+    https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
+https://alternativeto.net/software/apache-nutch/
+https://alternativeto.net/software/wget/
+https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
+https://github.com/ArchiveTeam/wpull

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33541 for gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

Legend:

gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

Download in other formats: