Timestamp:
2019-10-01T22:27:03+13:00 (5 years ago)
Author:
ak19
Message:
  1. hdfs-cc-work/GS_README.txt now contains the complete instructions to use Autistici crawl to download a website (as WARC file) as well as now also the instructions to convert those WARCs to WET. 2. Moved the first part out of MoreReading/crawling-Nutch.txt. 3. Adding patched WARC-to-WET files for the gitprojects ia-web-commons and ia-hadoop-tools to successfully do the WARC-to-WET processing on WARC files generated by Austistici crawl. (Worked on Dr Bainbridge's home page site as a test. Not tried any other site yet, as I wanted to get the work flow from crawl to WET working.)
File:
1 added

Note: See TracChangeset for help on using the changeset viewer.