Ignore:
Timestamp:
2019-08-13T15:59:29+12:00 (5 years ago)
Author:
ak19
Message:

Forgot to commit 2 files with links and shuffling some links around into the correct files after moving between computers.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33393 r33409  
     1
     2WET FILES:
     3
     4https://stackoverflow.com/questions/16649535/access-a-common-crawl-aws-public-dataset/25297965#25297965
     5
     6
     7http://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
     8            File List   #Files  Total Size Compressed (TiB)
     9    WET files   CC-MAIN-2019-26/wet.paths.gz    56000   7.59
     10
     11
     12http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
     13(Instructions)
     14
     15https://gist.github.com/svemir/4207353
     16(Hadoop related) A Common Crawl Experiment
     17
     18https://gist.github.com/Smerity/afe7430fdb4371015466
     19
     20 Extract just the text from Common Crawl WARC WET files
     21 
     22https://stackoverflow.com/tags/common-crawl/hot?filter=all
     23
     24https://stackoverflow.com/questions/45920527/get-offset-and-length-of-a-subset-of-a-wat-archive-from-common-crawl-index-serve/46152773#46152773
     25
     26
     27"The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same."
     28
     29https://dmorgan.info/posts/common-crawl-python/
     30https://groups.google.com/forum/#!topic/common-crawl/pdI3w09AAbQ
     31
     32Example:
     33WARC:
     34tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/crawldiagnostics/CC-MAIN-20190719115720-20190719141720-00077.warc.gz
     35WET:
     36tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/wet/CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz
     37tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz
     38
     39
     40--------------------------------------------
    141http://webdatacommons.org/
    242
Note: See TracChangeset for help on using the changeset viewer.