Ignore:
Timestamp:
2019-10-01T22:27:03+13:00 (5 years ago)
Author:
ak19
Message:
  1. hdfs-cc-work/GS_README.txt now contains the complete instructions to use Autistici crawl to download a website (as WARC file) as well as now also the instructions to convert those WARCs to WET. 2. Moved the first part out of MoreReading/crawling-Nutch.txt. 3. Adding patched WARC-to-WET files for the gitprojects ia-web-commons and ia-hadoop-tools to successfully do the WARC-to-WET processing on WARC files generated by Austistici crawl. (Worked on Dr Bainbridge's home page site as a test. Not tried any other site yet, as I wanted to get the work flow from crawl to WET working.)
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33540 r33541  
    44https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
    55
    6 
    7 ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
    8 https://anarc.at/services/archive/web/
    9     Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
    10     https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f   
    11     https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
    12         https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
    13     https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
    14 https://alternativeto.net/software/apache-nutch/
    15 https://alternativeto.net/software/wget/
    16 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
    17 https://github.com/ArchiveTeam/wpull
    18 
    19 ---
    20 Autistici crawl:
    21 ---
    22 1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
    23 2. Create go environment:
    24 #!/bin/bash
    25 # environment vars for golang               
    26 export GOROOT=/usr/local/go
    27 export GOPATH=$HOME/go
    28 export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
    29 3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
    30 
    31 These steps work:
    32 
    33 cd $GOPATH
    34 mkdir bin
    35 mkdir src
    36 cd src
    37 
    38 4. Since trying to go install the crawl url didn't work
    39 https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
    40 
    41 vagrant@node2:~/go/src$
    42   mkdir -p git.autistici.org/ale
    43   cd git.autistici.org/ale
    44   git clone https://git.autistici.org/ale/crawl.git
    45 
    46 [Now can run the install command in README.md:]
    47   cd $GOPATH/src
    48   go install git.autistici.org/ale/crawl/cmd/crawl
    49 
    50 Now we should have a $GOPATH/bin folder containing the "crawl" binary
    51 
    52 5. Run a crawl:
    53   cd $GOPATH/bin
    54   ./crawl https://www.cs.waikato.ac.nz/~davidb/
    55 
    56 which downloads the site and puts the warc file into the $GOPATH/bin folder.
    57 
    58 More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
    59 
    60 6. To view the RAW contents of a WARC file:
    61 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
    62 
    63 zless <warc-file-name>
    64 
    65 zless already installed on vagrant file
    66 
    67 
    68 Issues converting to Wet:
    69 Not the correct warc format: missing elements in header, ordering different.
    70 But WET is an official format, not CommonCrawl specific:
    71 
    72 https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
    73 WET (parsed text)
    74 
    75 WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format.
    766
    777-----------
     
    15383
    15484
     85Solution to get a working nutch2:
     86get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz
     87And follow the instructions in my README file in there.
     88
     89---------------------------------------------------------------------
     90ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
     91---------------------------------------------------------------------
     92=> https://anarc.at/services/archive/web/
     93    Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
     94    https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f   
     95    https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
     96        To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
     97    https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
     98https://alternativeto.net/software/apache-nutch/
     99https://alternativeto.net/software/wget/
     100https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
     101https://github.com/ArchiveTeam/wpull
     102
Note: See TracChangeset for help on using the changeset viewer.