Changeset 33541

Show
Ignore:
Timestamp:
01.10.2019 22:27:03 (2 weeks ago)
Author:
ak19
Message:

1. hdfs-cc-work/GS_README.txt now contains the complete instructions to use Autistici crawl to download a website (as WARC file) as well as now also the instructions to convert those WARCs to WET. 2. Moved the first part out of MoreReading?/crawling-Nutch.txt. 3. Adding patched WARC-to-WET files for the gitprojects ia-web-commons and ia-hadoop-tools to successfully do the WARC-to-WET processing on WARC files generated by Austistici crawl. (Worked on Dr Bainbridge's home page site as a test. Not tried any other site yet, as I wanted to get the work flow from crawl to WET working.)

Location:
gs3-extensions/maori-lang-detection
Files:
2 added
2 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33540 r33541  
    44https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps 
    55 
    6  
    7 ALTERNATIVES TO NUTCH - looking for site mirroring capabilities 
    8 https://anarc.at/services/archive/web/  
    9     Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go: 
    10     https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f     
    11     https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/ 
    12         https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd 
    13     https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"] 
    14 https://alternativeto.net/software/apache-nutch/ 
    15 https://alternativeto.net/software/wget/ 
    16 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal 
    17 https://github.com/ArchiveTeam/wpull 
    18  
    19 --- 
    20 Autistici crawl: 
    21 --- 
    22 1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f 
    23 2. Create go environment: 
    24 #!/bin/bash 
    25 # environment vars for golang                
    26 export GOROOT=/usr/local/go 
    27 export GOPATH=$HOME/go 
    28 export PATH=$GOPATH/bin:$GOROOT/bin:$PATH 
    29 3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage. 
    30  
    31 These steps work: 
    32  
    33 cd $GOPATH 
    34 mkdir bin 
    35 mkdir src 
    36 cd src 
    37  
    38 4. Since trying to go install the crawl url didn't work 
    39 https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main 
    40  
    41 vagrant@node2:~/go/src$ 
    42   mkdir -p git.autistici.org/ale 
    43   cd git.autistici.org/ale 
    44   git clone https://git.autistici.org/ale/crawl.git 
    45  
    46 [Now can run the install command in README.md:] 
    47   cd $GOPATH/src 
    48   go install git.autistici.org/ale/crawl/cmd/crawl 
    49  
    50 Now we should have a $GOPATH/bin folder containing the "crawl" binary 
    51  
    52 5. Run a crawl: 
    53   cd $GOPATH/bin 
    54   ./crawl https://www.cs.waikato.ac.nz/~davidb/ 
    55  
    56 which downloads the site and puts the warc file into the $GOPATH/bin folder. 
    57  
    58 More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md 
    59  
    60 6. To view the RAW contents of a WARC file: 
    61 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives 
    62  
    63 zless <warc-file-name> 
    64  
    65 zless already installed on vagrant file 
    66  
    67  
    68 Issues converting to Wet: 
    69 Not the correct warc format: missing elements in header, ordering different. 
    70 But WET is an official format, not CommonCrawl specific: 
    71  
    72 https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis 
    73 WET (parsed text) 
    74  
    75 WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format. 
    766 
    777----------- 
     
    15383 
    15484 
     85Solution to get a working nutch2: 
     86get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz 
     87And follow the instructions in my README file in there. 
     88 
     89--------------------------------------------------------------------- 
     90ALTERNATIVES TO NUTCH - looking for site mirroring capabilities 
     91--------------------------------------------------------------------- 
     92=> https://anarc.at/services/archive/web/  
     93    Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go: 
     94    https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f     
     95    https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/ 
     96        To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd 
     97    https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"] 
     98https://alternativeto.net/software/apache-nutch/ 
     99https://alternativeto.net/software/wget/ 
     100https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal 
     101https://github.com/ArchiveTeam/wpull 
     102 
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33539 r33541  
    1313F.  Setup warc-to-wet tools (git projects) 
    1414G.  Getting and running our scripts 
     15--- 
     16H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps 
     17 
    1518---------------------------------------- 
    1619 
     
    392395Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java 
    393396 
     397----------------------------------- 
     398H. Austici crawl 
     399----------------------------------- 
     400Austici's crawl: CLI to download web sites as WARCs, features basics to avoid crawler taps. 
     401 
     402Out of several software to do site mirroring, Autistici's "crawl" seemed promising: 
     403https://anarc.at/services/archive/web/ 
     404 
     405- CLI. 
     406- Can download a website quite simply, though flags for additional settings are available. 
     407- Coded to prevent common traps. 
     408- Downloads website as WARC file 
     409- Now I have the WARC to WET process working for the WARC file it produced for the usual test site (Dr Bainbridge's home page) 
     410 
     411Need to have Go installed in order to install and run Autistici's crawl. 
     412Not a problem, because I can do it on the remote machine (which also hosts the hdfs) where I have sudo powers. 
     413 
     414INSTRUCTIONS 
     415 
     4161. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f 
     4172. Create go environment: 
     418#!/bin/bash 
     419# environment vars for golang                
     420export GOROOT=/usr/local/go 
     421export GOPATH=$HOME/go 
     422export PATH=$GOPATH/bin:$GOROOT/bin:$PATH 
     4233. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage. 
     424 
     425These steps work: 
     426 
     427cd $GOPATH 
     428mkdir bin 
     429mkdir src 
     430cd src 
     431 
     4324. Since trying to go install the crawl url didn't work 
     433https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main 
     434 
     435vagrant@node2:~/go/src$ 
     436  mkdir -p git.autistici.org/ale 
     437  cd git.autistici.org/ale 
     438  git clone https://git.autistici.org/ale/crawl.git 
     439 
     440[Now can run the install command in README.md:] 
     441  cd $GOPATH/src 
     442  go install git.autistici.org/ale/crawl/cmd/crawl 
     443 
     444Now we should have a $GOPATH/bin folder containing the "crawl" binary 
     445 
     4465. Run a crawl: 
     447  cd $GOPATH/bin 
     448  ./crawl https://www.cs.waikato.ac.nz/~davidb/ 
     449 
     450which downloads the site and puts the warc file into the $GOPATH/bin folder. 
     451 
     452More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md 
     453 
     4546. To view the RAW contents of a WARC file: 
     455https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives 
     456 
     457zless <warc-file-name> 
     458 
     459zless already installed on vagrant file 
     460 
     461 
     462----------------------------------------------------------------------------------------------- 
     463How to run warc-to-wet conversion on sites downloaded as WARCs by Austici's "crawl"  
     464----------------------------------------------------------------------------------------------- 
     465ISSUES CONVERTING WARC to WET: 
     466--- 
     467WARC files produced by Autistici crawl are of a somewhat different format to CommonCrawl WARCs. 
     468- missing elements in header 
     469- different header elements 
     470- ordering different (if that matters) 
     471 
     472But WET is an official format, not CommonCrawl specific, as indicated by 
     473 
     474https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis 
     475"WET (parsed text) 
     476 
     477WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format." 
     478 
     479So must be possible to get WARC to WET conversion used for CommonCrawl data to work on Autistici crawl's WARC files. 
     480 
     481 
     482RESOLUTION: 
     483--- 
     484I made changes to 2 java source files in the 2 github projects ia-web-commons and ia-hadoop-tools, which we use for the WARC to WET processing of CommonCrawl data. These gitprojects (with modifications for commoncrawl) are already on http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/gitprojects. 
     485 
     486The changed files are as follows: 
     4871. patches/WATExtractorOutput.java 
     488   put into ia-web-commons/src/main/java/org/archive/extract 
     489   after renaming existing to .orig 
     490 
     491THEN RECOMPILE ia-web-commons with: 
     492   mvn install 
     493 
     4942. patches/GZRangeClient.java 
     495   put into ia-hadoop-tools/src/main/java/org/archive/server 
     496   after renaming existing to .orig 
     497   
     498THEN RECOMPILE ia-hadoop-tools with: 
     499   mvn package 
     500 
     501Make sure to first compile ia-web-commons, then ia-hadoop-tools. 
     502 
     503 
     504The modifications made to the above 2 files are as follows: 
     505>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
     5061. ia-web-commons/src/main/java/org/archive/extract/WATExtractorOutput.java 
     507 
     508[diff src/main/java/org/archive/extract/WATExtractorOutput.orig src/main/java/org/archive/extract/WATExtractorOutput.java] 
     509 
     510162,163c162,163 
     511<           targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Filename"); 
     512<       } else { 
     513--- 
     514>           targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Warcinfo-ID"); 
     515>       } else { 
     516 
     517 
     5182. ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java 
     519 
     520[diff src/main/java/org/archive/server/GZRangeClient.orig src/main/java/org/archive/server/GZRangeClient.java] 
     521 
     52276,83c76,82 
     523<       "WARC/1.0\r\n" + 
     524<       "WARC-Type: warcinfo\r\n" + 
     525<       "WARC-Date: %s\r\n" + 
     526<       "WARC-Filename: %s\r\n" + 
     527<       "WARC-Record-ID: <urn:uuid:%s>\r\n" + 
     528<       "Content-Type: application/warc-fields\r\n" + 
     529<       "Content-Length: %d\r\n\r\n"; 
     530<  
     531--- 
     532>       "WARC/1.0\r\n" + 
     533>       "Content-Type: application/warc-fields\r\n" + 
     534>       "WARC-Type: warcinfo\r\n" + 
     535>       "WARC-Warcinfo-ID: <urn:uuid:%s>\r\n" + 
     536>       "Content-Length: %d\r\n\r\n" + 
     537>       "WARC-Record-ID: <urn:uuid:%s>\r\n" +        
     538>       "WARC-Date: %s\r\n"; 
     539115,119c114,119 
     540<   private static String DEFAULT_WARC_PATTERN = "software: %s Extractor\r\n" + 
     541<   "format: WARC File Format 1.0\r\n" + 
     542<   "conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n" + 
     543<   "publisher: Internet Archive\r\n" + 
     544<   "created: %s\r\n\r\n"; 
     545--- 
     546>   private static String DEFAULT_WARC_PATTERN = "Software: crawl/1.0\r\n" + 
     547>   "Format: WARC File Format 1.0\r\n" + 
     548>       "Conformsto: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n\r\n"; 
     549>     // + 
     550>     //"publisher: Internet Archive\r\n" + 
     551>     //"created: %s\r\n\r\n"; 
     552<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 
     553 
     554 
     5553. To run WARC to WET, the warc needs to live on hdfs in a warc folder and there should be wet and wat folders at the same level. 
     556 
     557For example, assume that running Autistici's crawl generated $GOPATH/bin/crawl.warc.gz 
     558(default location and filename unless you pass flags to crawl CLI to control these) 
     559 
     560a. Ensure you get crawl.warc.gz onto the vagrant VM with the WARC to WET git projects installed, recompiled with the above modifications. 
     561 
     562b. Now, create the folder structure needed for warc-to-wet conversion: 
     563   hdfs dfs -mkdir /user/vagrant/warctest 
     564   hdfs dfs -mkdir /user/vagrant/warctest/warc 
     565   hdfs dfs -mkdir /user/vagrant/warctest/wet 
     566   hdfs dfs -mkdir /user/vagrant/warctest/wat 
     567 
     568c. Put crawl.warc.gz into the warc folder on hfds: 
     569   hdfs dfs -put crawl.warc.gz /user/vagrant/warctest/warc/. 
     570 
     571d. Finally, time to run the actual warc-to-wet conversion from ia-hadoop-tools: 
     572   cd ia-hadoop-tools 
     573   WARC_FOLDER=/user/vagrant/warctest/warc 
     574   $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz $WARC_FOLDER/crawl*.warc.gz 
     575 
     576More meaningful when the WARC_FOLDER contains multiple *.warc.gz files, 
     577as the above will use map-reduce to generate the *.warc.wet.gz files in the output wet folder. 
     578 
     579e. Copy the generated wet files across from /user/vagrant/warctest/wet/: 
     580 
     581   (cd /vagrant or else 
     582   cd /home/vagrant 
     583   ) 
     584   hdfs dfs -get /user/vagrant/warctest/wet/crawl.warc.wet.gz . 
     585 
     586or, when dealing with multiple input warc files, we'll have multiple wet files: 
     587    hdfs dfs -get /user/vagrant/warctest/wet/*.warc.wet.gz 
     588 
     589 
     590f. Now can view the contents of the WET files to confirm they are what we want: 
     591   gunzip crawl.warc.wet.gz 
     592   zless crawl.warc.wet 
     593 
     594The wet file contents should look good now: the web pages as WET records without html tags. 
     595 
    394596 
    395597-----------------------EOF------------------------