Changeset 33541


Ignore:
Timestamp:
2019-10-01T22:27:03+13:00 (5 years ago)
Author:
ak19
Message:
  1. hdfs-cc-work/GS_README.txt now contains the complete instructions to use Autistici crawl to download a website (as WARC file) as well as now also the instructions to convert those WARCs to WET. 2. Moved the first part out of MoreReading/crawling-Nutch.txt. 3. Adding patched WARC-to-WET files for the gitprojects ia-web-commons and ia-hadoop-tools to successfully do the WARC-to-WET processing on WARC files generated by Austistici crawl. (Worked on Dr Bainbridge's home page site as a test. Not tried any other site yet, as I wanted to get the work flow from crawl to WET working.)
Location:
gs3-extensions/maori-lang-detection
Files:
2 added
2 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33540 r33541  
    44https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
    55
    6 
    7 ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
    8 https://anarc.at/services/archive/web/
    9     Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
    10     https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f   
    11     https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
    12         https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
    13     https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
    14 https://alternativeto.net/software/apache-nutch/
    15 https://alternativeto.net/software/wget/
    16 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
    17 https://github.com/ArchiveTeam/wpull
    18 
    19 ---
    20 Autistici crawl:
    21 ---
    22 1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
    23 2. Create go environment:
    24 #!/bin/bash
    25 # environment vars for golang               
    26 export GOROOT=/usr/local/go
    27 export GOPATH=$HOME/go
    28 export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
    29 3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
    30 
    31 These steps work:
    32 
    33 cd $GOPATH
    34 mkdir bin
    35 mkdir src
    36 cd src
    37 
    38 4. Since trying to go install the crawl url didn't work
    39 https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
    40 
    41 vagrant@node2:~/go/src$
    42   mkdir -p git.autistici.org/ale
    43   cd git.autistici.org/ale
    44   git clone https://git.autistici.org/ale/crawl.git
    45 
    46 [Now can run the install command in README.md:]
    47   cd $GOPATH/src
    48   go install git.autistici.org/ale/crawl/cmd/crawl
    49 
    50 Now we should have a $GOPATH/bin folder containing the "crawl" binary
    51 
    52 5. Run a crawl:
    53   cd $GOPATH/bin
    54   ./crawl https://www.cs.waikato.ac.nz/~davidb/
    55 
    56 which downloads the site and puts the warc file into the $GOPATH/bin folder.
    57 
    58 More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
    59 
    60 6. To view the RAW contents of a WARC file:
    61 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
    62 
    63 zless <warc-file-name>
    64 
    65 zless already installed on vagrant file
    66 
    67 
    68 Issues converting to Wet:
    69 Not the correct warc format: missing elements in header, ordering different.
    70 But WET is an official format, not CommonCrawl specific:
    71 
    72 https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
    73 WET (parsed text)
    74 
    75 WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format.
    766
    777-----------
     
    15383
    15484
     85Solution to get a working nutch2:
     86get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz
     87And follow the instructions in my README file in there.
     88
     89---------------------------------------------------------------------
     90ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
     91---------------------------------------------------------------------
     92=> https://anarc.at/services/archive/web/
     93    Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
     94    https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f   
     95    https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
     96        To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
     97    https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
     98https://alternativeto.net/software/apache-nutch/
     99https://alternativeto.net/software/wget/
     100https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
     101https://github.com/ArchiveTeam/wpull
     102
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33539 r33541  
    1313F.  Setup warc-to-wet tools (git projects)
    1414G.  Getting and running our scripts
     15---
     16H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
     17
    1518----------------------------------------
    1619
     
    392395Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
    393396
     397-----------------------------------
     398H. Austici crawl
     399-----------------------------------
     400Austici's crawl: CLI to download web sites as WARCs, features basics to avoid crawler taps.
     401
     402Out of several software to do site mirroring, Autistici's "crawl" seemed promising:
     403https://anarc.at/services/archive/web/
     404
     405- CLI.
     406- Can download a website quite simply, though flags for additional settings are available.
     407- Coded to prevent common traps.
     408- Downloads website as WARC file
     409- Now I have the WARC to WET process working for the WARC file it produced for the usual test site (Dr Bainbridge's home page)
     410
     411Need to have Go installed in order to install and run Autistici's crawl.
     412Not a problem, because I can do it on the remote machine (which also hosts the hdfs) where I have sudo powers.
     413
     414INSTRUCTIONS
     415
     4161. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
     4172. Create go environment:
     418#!/bin/bash
     419# environment vars for golang               
     420export GOROOT=/usr/local/go
     421export GOPATH=$HOME/go
     422export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
     4233. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.
     424
     425These steps work:
     426
     427cd $GOPATH
     428mkdir bin
     429mkdir src
     430cd src
     431
     4324. Since trying to go install the crawl url didn't work
     433https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main
     434
     435vagrant@node2:~/go/src$
     436  mkdir -p git.autistici.org/ale
     437  cd git.autistici.org/ale
     438  git clone https://git.autistici.org/ale/crawl.git
     439
     440[Now can run the install command in README.md:]
     441  cd $GOPATH/src
     442  go install git.autistici.org/ale/crawl/cmd/crawl
     443
     444Now we should have a $GOPATH/bin folder containing the "crawl" binary
     445
     4465. Run a crawl:
     447  cd $GOPATH/bin
     448  ./crawl https://www.cs.waikato.ac.nz/~davidb/
     449
     450which downloads the site and puts the warc file into the $GOPATH/bin folder.
     451
     452More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md
     453
     4546. To view the RAW contents of a WARC file:
     455https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives
     456
     457zless <warc-file-name>
     458
     459zless already installed on vagrant file
     460
     461
     462-----------------------------------------------------------------------------------------------
     463How to run warc-to-wet conversion on sites downloaded as WARCs by Austici's "crawl"
     464-----------------------------------------------------------------------------------------------
     465ISSUES CONVERTING WARC to WET:
     466---
     467WARC files produced by Autistici crawl are of a somewhat different format to CommonCrawl WARCs.
     468- missing elements in header
     469- different header elements
     470- ordering different (if that matters)
     471
     472But WET is an official format, not CommonCrawl specific, as indicated by
     473
     474https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis
     475"WET (parsed text)
     476
     477WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format."
     478
     479So must be possible to get WARC to WET conversion used for CommonCrawl data to work on Autistici crawl's WARC files.
     480
     481
     482RESOLUTION:
     483---
     484I made changes to 2 java source files in the 2 github projects ia-web-commons and ia-hadoop-tools, which we use for the WARC to WET processing of CommonCrawl data. These gitprojects (with modifications for commoncrawl) are already on http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/gitprojects.
     485
     486The changed files are as follows:
     4871. patches/WATExtractorOutput.java
     488   put into ia-web-commons/src/main/java/org/archive/extract
     489   after renaming existing to .orig
     490
     491THEN RECOMPILE ia-web-commons with:
     492   mvn install
     493
     4942. patches/GZRangeClient.java
     495   put into ia-hadoop-tools/src/main/java/org/archive/server
     496   after renaming existing to .orig
     497 
     498THEN RECOMPILE ia-hadoop-tools with:
     499   mvn package
     500
     501Make sure to first compile ia-web-commons, then ia-hadoop-tools.
     502
     503
     504The modifications made to the above 2 files are as follows:
     505>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
     5061. ia-web-commons/src/main/java/org/archive/extract/WATExtractorOutput.java
     507
     508[diff src/main/java/org/archive/extract/WATExtractorOutput.orig src/main/java/org/archive/extract/WATExtractorOutput.java]
     509
     510162,163c162,163
     511<           targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Filename");
     512<       } else {
     513---
     514>           targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Warcinfo-ID");
     515>       } else {
     516
     517
     5182. ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java
     519
     520[diff src/main/java/org/archive/server/GZRangeClient.orig src/main/java/org/archive/server/GZRangeClient.java]
     521
     52276,83c76,82
     523<       "WARC/1.0\r\n" +
     524<       "WARC-Type: warcinfo\r\n" +
     525<       "WARC-Date: %s\r\n" +
     526<       "WARC-Filename: %s\r\n" +
     527<       "WARC-Record-ID: <urn:uuid:%s>\r\n" +
     528<       "Content-Type: application/warc-fields\r\n" +
     529<       "Content-Length: %d\r\n\r\n";
     530<
     531---
     532>       "WARC/1.0\r\n" +
     533>       "Content-Type: application/warc-fields\r\n" +
     534>       "WARC-Type: warcinfo\r\n" +
     535>       "WARC-Warcinfo-ID: <urn:uuid:%s>\r\n" +
     536>       "Content-Length: %d\r\n\r\n" +
     537>       "WARC-Record-ID: <urn:uuid:%s>\r\n" +       
     538>       "WARC-Date: %s\r\n";
     539115,119c114,119
     540<   private static String DEFAULT_WARC_PATTERN = "software: %s Extractor\r\n" +
     541<   "format: WARC File Format 1.0\r\n" +
     542<   "conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n" +
     543<   "publisher: Internet Archive\r\n" +
     544<   "created: %s\r\n\r\n";
     545---
     546>   private static String DEFAULT_WARC_PATTERN = "Software: crawl/1.0\r\n" +
     547>   "Format: WARC File Format 1.0\r\n" +
     548>       "Conformsto: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n\r\n";
     549>     // +
     550>     //"publisher: Internet Archive\r\n" +
     551>     //"created: %s\r\n\r\n";
     552<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
     553
     554
     5553. To run WARC to WET, the warc needs to live on hdfs in a warc folder and there should be wet and wat folders at the same level.
     556
     557For example, assume that running Autistici's crawl generated $GOPATH/bin/crawl.warc.gz
     558(default location and filename unless you pass flags to crawl CLI to control these)
     559
     560a. Ensure you get crawl.warc.gz onto the vagrant VM with the WARC to WET git projects installed, recompiled with the above modifications.
     561
     562b. Now, create the folder structure needed for warc-to-wet conversion:
     563   hdfs dfs -mkdir /user/vagrant/warctest
     564   hdfs dfs -mkdir /user/vagrant/warctest/warc
     565   hdfs dfs -mkdir /user/vagrant/warctest/wet
     566   hdfs dfs -mkdir /user/vagrant/warctest/wat
     567
     568c. Put crawl.warc.gz into the warc folder on hfds:
     569   hdfs dfs -put crawl.warc.gz /user/vagrant/warctest/warc/.
     570
     571d. Finally, time to run the actual warc-to-wet conversion from ia-hadoop-tools:
     572   cd ia-hadoop-tools
     573   WARC_FOLDER=/user/vagrant/warctest/warc
     574   $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz $WARC_FOLDER/crawl*.warc.gz
     575
     576More meaningful when the WARC_FOLDER contains multiple *.warc.gz files,
     577as the above will use map-reduce to generate the *.warc.wet.gz files in the output wet folder.
     578
     579e. Copy the generated wet files across from /user/vagrant/warctest/wet/:
     580
     581   (cd /vagrant or else
     582   cd /home/vagrant
     583   )
     584   hdfs dfs -get /user/vagrant/warctest/wet/crawl.warc.wet.gz .
     585
     586or, when dealing with multiple input warc files, we'll have multiple wet files:
     587    hdfs dfs -get /user/vagrant/warctest/wet/*.warc.wet.gz
     588
     589
     590f. Now can view the contents of the WET files to confirm they are what we want:
     591   gunzip crawl.warc.wet.gz
     592   zless crawl.warc.wet
     593
     594The wet file contents should look good now: the web pages as WET records without html tags.
     595
    394596
    395597-----------------------EOF------------------------
Note: See TracChangeset for help on using the changeset viewer.