Changeset 33408

Show
Ignore:
Timestamp:
13.08.2019 15:09:28 (9 days ago)
Author:
ak19
Message:

Some rough notes. Will move into appropriate file later.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/other.txt

    r33404 r33408  
    1919 
    2020https://gist.github.com/svemir/4207353 
     21(Hadoop related) A Common Crawl Experiment 
    2122 
    2223https://gist.github.com/Smerity/afe7430fdb4371015466 
     
    3233 
    3334https://dmorgan.info/posts/common-crawl-python/ 
     35https://groups.google.com/forum/#!topic/common-crawl/pdI3w09AAbQ 
     36 
     37Example: 
     38WARC: 
     39tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/crawldiagnostics/CC-MAIN-20190719115720-20190719141720-00077.warc.gz 
     40WET: 
     41tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/wet/CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz 
     42tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz  
     43