Changeset 33541
- Timestamp:
- 2019-10-01T22:27:03+13:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection
- Files:
-
- 2 added
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt
r33540 r33541 4 4 https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps 5 5 6 7 ALTERNATIVES TO NUTCH - looking for site mirroring capabilities8 https://anarc.at/services/archive/web/9 Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:10 https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f11 https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/12 https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd13 https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]14 https://alternativeto.net/software/apache-nutch/15 https://alternativeto.net/software/wget/16 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal17 https://github.com/ArchiveTeam/wpull18 19 ---20 Autistici crawl:21 ---22 1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f23 2. Create go environment:24 #!/bin/bash25 # environment vars for golang26 export GOROOT=/usr/local/go27 export GOPATH=$HOME/go28 export PATH=$GOPATH/bin:$GOROOT/bin:$PATH29 3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage.30 31 These steps work:32 33 cd $GOPATH34 mkdir bin35 mkdir src36 cd src37 38 4. Since trying to go install the crawl url didn't work39 https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main40 41 vagrant@node2:~/go/src$42 mkdir -p git.autistici.org/ale43 cd git.autistici.org/ale44 git clone https://git.autistici.org/ale/crawl.git45 46 [Now can run the install command in README.md:]47 cd $GOPATH/src48 go install git.autistici.org/ale/crawl/cmd/crawl49 50 Now we should have a $GOPATH/bin folder containing the "crawl" binary51 52 5. Run a crawl:53 cd $GOPATH/bin54 ./crawl https://www.cs.waikato.ac.nz/~davidb/55 56 which downloads the site and puts the warc file into the $GOPATH/bin folder.57 58 More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md59 60 6. To view the RAW contents of a WARC file:61 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives62 63 zless <warc-file-name>64 65 zless already installed on vagrant file66 67 68 Issues converting to Wet:69 Not the correct warc format: missing elements in header, ordering different.70 But WET is an official format, not CommonCrawl specific:71 72 https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis73 WET (parsed text)74 75 WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format.76 6 77 7 ----------- … … 153 83 154 84 85 Solution to get a working nutch2: 86 get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz 87 And follow the instructions in my README file in there. 88 89 --------------------------------------------------------------------- 90 ALTERNATIVES TO NUTCH - looking for site mirroring capabilities 91 --------------------------------------------------------------------- 92 => https://anarc.at/services/archive/web/ 93 Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go: 94 https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f 95 https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/ 96 To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd 97 https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"] 98 https://alternativeto.net/software/apache-nutch/ 99 https://alternativeto.net/software/wget/ 100 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal 101 https://github.com/ArchiveTeam/wpull 102 -
gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT
r33539 r33541 13 13 F. Setup warc-to-wet tools (git projects) 14 14 G. Getting and running our scripts 15 --- 16 H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps 17 15 18 ---------------------------------------- 16 19 … … 392 395 Each of these output wet folders can then be processed in turn by CCWETProcessor.java from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java 393 396 397 ----------------------------------- 398 H. Austici crawl 399 ----------------------------------- 400 Austici's crawl: CLI to download web sites as WARCs, features basics to avoid crawler taps. 401 402 Out of several software to do site mirroring, Autistici's "crawl" seemed promising: 403 https://anarc.at/services/archive/web/ 404 405 - CLI. 406 - Can download a website quite simply, though flags for additional settings are available. 407 - Coded to prevent common traps. 408 - Downloads website as WARC file 409 - Now I have the WARC to WET process working for the WARC file it produced for the usual test site (Dr Bainbridge's home page) 410 411 Need to have Go installed in order to install and run Autistici's crawl. 412 Not a problem, because I can do it on the remote machine (which also hosts the hdfs) where I have sudo powers. 413 414 INSTRUCTIONS 415 416 1. Install go 1.11 by following instructions at https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f 417 2. Create go environment: 418 #!/bin/bash 419 # environment vars for golang 420 export GOROOT=/usr/local/go 421 export GOPATH=$HOME/go 422 export PATH=$GOPATH/bin:$GOROOT/bin:$PATH 423 3. The https://git.autistici.org/ale/crawl/README.md instructions on installing are not very clear and don't work as is at this stage. 424 425 These steps work: 426 427 cd $GOPATH 428 mkdir bin 429 mkdir src 430 cd src 431 432 4. Since trying to go install the crawl url didn't work 433 https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main 434 435 vagrant@node2:~/go/src$ 436 mkdir -p git.autistici.org/ale 437 cd git.autistici.org/ale 438 git clone https://git.autistici.org/ale/crawl.git 439 440 [Now can run the install command in README.md:] 441 cd $GOPATH/src 442 go install git.autistici.org/ale/crawl/cmd/crawl 443 444 Now we should have a $GOPATH/bin folder containing the "crawl" binary 445 446 5. Run a crawl: 447 cd $GOPATH/bin 448 ./crawl https://www.cs.waikato.ac.nz/~davidb/ 449 450 which downloads the site and puts the warc file into the $GOPATH/bin folder. 451 452 More options, including output folder, WARC filename pattern for huge sites so that multiple warc files created for one site follow the same pattern are all in the instructions in README.md 453 454 6. To view the RAW contents of a WARC file: 455 https://github.com/ArchiveTeam/grab-site/blob/master/README.md#viewing-the-content-in-your-warc-archives 456 457 zless <warc-file-name> 458 459 zless already installed on vagrant file 460 461 462 ----------------------------------------------------------------------------------------------- 463 How to run warc-to-wet conversion on sites downloaded as WARCs by Austici's "crawl" 464 ----------------------------------------------------------------------------------------------- 465 ISSUES CONVERTING WARC to WET: 466 --- 467 WARC files produced by Autistici crawl are of a somewhat different format to CommonCrawl WARCs. 468 - missing elements in header 469 - different header elements 470 - ordering different (if that matters) 471 472 But WET is an official format, not CommonCrawl specific, as indicated by 473 474 https://library.stanford.edu/projects/web-archiving/research-resources/data-formats-and-apis 475 "WET (parsed text) 476 477 WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format." 478 479 So must be possible to get WARC to WET conversion used for CommonCrawl data to work on Autistici crawl's WARC files. 480 481 482 RESOLUTION: 483 --- 484 I made changes to 2 java source files in the 2 github projects ia-web-commons and ia-hadoop-tools, which we use for the WARC to WET processing of CommonCrawl data. These gitprojects (with modifications for commoncrawl) are already on http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/gitprojects. 485 486 The changed files are as follows: 487 1. patches/WATExtractorOutput.java 488 put into ia-web-commons/src/main/java/org/archive/extract 489 after renaming existing to .orig 490 491 THEN RECOMPILE ia-web-commons with: 492 mvn install 493 494 2. patches/GZRangeClient.java 495 put into ia-hadoop-tools/src/main/java/org/archive/server 496 after renaming existing to .orig 497 498 THEN RECOMPILE ia-hadoop-tools with: 499 mvn package 500 501 Make sure to first compile ia-web-commons, then ia-hadoop-tools. 502 503 504 The modifications made to the above 2 files are as follows: 505 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 506 1. ia-web-commons/src/main/java/org/archive/extract/WATExtractorOutput.java 507 508 [diff src/main/java/org/archive/extract/WATExtractorOutput.orig src/main/java/org/archive/extract/WATExtractorOutput.java] 509 510 162,163c162,163 511 < targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Filename"); 512 < } else { 513 --- 514 > targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Warcinfo-ID"); 515 > } else { 516 517 518 2. ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java 519 520 [diff src/main/java/org/archive/server/GZRangeClient.orig src/main/java/org/archive/server/GZRangeClient.java] 521 522 76,83c76,82 523 < "WARC/1.0\r\n" + 524 < "WARC-Type: warcinfo\r\n" + 525 < "WARC-Date: %s\r\n" + 526 < "WARC-Filename: %s\r\n" + 527 < "WARC-Record-ID: <urn:uuid:%s>\r\n" + 528 < "Content-Type: application/warc-fields\r\n" + 529 < "Content-Length: %d\r\n\r\n"; 530 < 531 --- 532 > "WARC/1.0\r\n" + 533 > "Content-Type: application/warc-fields\r\n" + 534 > "WARC-Type: warcinfo\r\n" + 535 > "WARC-Warcinfo-ID: <urn:uuid:%s>\r\n" + 536 > "Content-Length: %d\r\n\r\n" + 537 > "WARC-Record-ID: <urn:uuid:%s>\r\n" + 538 > "WARC-Date: %s\r\n"; 539 115,119c114,119 540 < private static String DEFAULT_WARC_PATTERN = "software: %s Extractor\r\n" + 541 < "format: WARC File Format 1.0\r\n" + 542 < "conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n" + 543 < "publisher: Internet Archive\r\n" + 544 < "created: %s\r\n\r\n"; 545 --- 546 > private static String DEFAULT_WARC_PATTERN = "Software: crawl/1.0\r\n" + 547 > "Format: WARC File Format 1.0\r\n" + 548 > "Conformsto: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\n\r\n"; 549 > // + 550 > //"publisher: Internet Archive\r\n" + 551 > //"created: %s\r\n\r\n"; 552 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 553 554 555 3. To run WARC to WET, the warc needs to live on hdfs in a warc folder and there should be wet and wat folders at the same level. 556 557 For example, assume that running Autistici's crawl generated $GOPATH/bin/crawl.warc.gz 558 (default location and filename unless you pass flags to crawl CLI to control these) 559 560 a. Ensure you get crawl.warc.gz onto the vagrant VM with the WARC to WET git projects installed, recompiled with the above modifications. 561 562 b. Now, create the folder structure needed for warc-to-wet conversion: 563 hdfs dfs -mkdir /user/vagrant/warctest 564 hdfs dfs -mkdir /user/vagrant/warctest/warc 565 hdfs dfs -mkdir /user/vagrant/warctest/wet 566 hdfs dfs -mkdir /user/vagrant/warctest/wat 567 568 c. Put crawl.warc.gz into the warc folder on hfds: 569 hdfs dfs -put crawl.warc.gz /user/vagrant/warctest/warc/. 570 571 d. Finally, time to run the actual warc-to-wet conversion from ia-hadoop-tools: 572 cd ia-hadoop-tools 573 WARC_FOLDER=/user/vagrant/warctest/warc 574 $HADOOP_MAPRED_HOME/bin/hadoop jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -strictMode -skipExisting batch-id-xyz $WARC_FOLDER/crawl*.warc.gz 575 576 More meaningful when the WARC_FOLDER contains multiple *.warc.gz files, 577 as the above will use map-reduce to generate the *.warc.wet.gz files in the output wet folder. 578 579 e. Copy the generated wet files across from /user/vagrant/warctest/wet/: 580 581 (cd /vagrant or else 582 cd /home/vagrant 583 ) 584 hdfs dfs -get /user/vagrant/warctest/wet/crawl.warc.wet.gz . 585 586 or, when dealing with multiple input warc files, we'll have multiple wet files: 587 hdfs dfs -get /user/vagrant/warctest/wet/*.warc.wet.gz 588 589 590 f. Now can view the contents of the WET files to confirm they are what we want: 591 gunzip crawl.warc.wet.gz 592 zless crawl.warc.wet 593 594 The wet file contents should look good now: the web pages as WET records without html tags. 595 394 596 395 597 -----------------------EOF------------------------
Note:
See TracChangeset
for help on using the changeset viewer.