Changeset 33409

Show
Ignore:
Timestamp:
13.08.2019 15:59:29 (9 days ago)
Author:
ak19
Message:

Forgot to commit 2 files with links and shuffling some links around into the correct files after moving between computers.

Location:
gs3-extensions/maori-lang-detection/MoreReading
Files:
2 added
2 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33393 r33409  
     1 
     2WET FILES: 
     3 
     4https://stackoverflow.com/questions/16649535/access-a-common-crawl-aws-public-dataset/25297965#25297965 
     5 
     6 
     7http://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/ 
     8            File List   #Files  Total Size Compressed (TiB) 
     9    WET files   CC-MAIN-2019-26/wet.paths.gz    56000   7.59 
     10 
     11 
     12http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ 
     13(Instructions) 
     14 
     15https://gist.github.com/svemir/4207353 
     16(Hadoop related) A Common Crawl Experiment 
     17 
     18https://gist.github.com/Smerity/afe7430fdb4371015466 
     19 
     20 Extract just the text from Common Crawl WARC WET files 
     21  
     22https://stackoverflow.com/tags/common-crawl/hot?filter=all 
     23 
     24https://stackoverflow.com/questions/45920527/get-offset-and-length-of-a-subset-of-a-wat-archive-from-common-crawl-index-serve/46152773#46152773 
     25 
     26 
     27"The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same." 
     28 
     29https://dmorgan.info/posts/common-crawl-python/ 
     30https://groups.google.com/forum/#!topic/common-crawl/pdI3w09AAbQ 
     31 
     32Example: 
     33WARC: 
     34tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/crawldiagnostics/CC-MAIN-20190719115720-20190719141720-00077.warc.gz 
     35WET: 
     36tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/wet/CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz 
     37tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz  
     38 
     39 
     40-------------------------------------------- 
    141http://webdatacommons.org/ 
    242 
  • gs3-extensions/maori-lang-detection/MoreReading/other.txt

    r33408 r33409  
    55 
    66 
    7 http://www.basicsbehind.com/extract-text-webpage/ 
    8     http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf 
    9 https://stackoverflow.com/questions/16649535/access-a-common-crawl-aws-public-dataset/25297965#25297965 
    10  
    11  
    12 http://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/ 
    13             File List   #Files  Total Size Compressed (TiB) 
    14     WET files   CC-MAIN-2019-26/wet.paths.gz    56000   7.59 
    15  
    16  
    17 http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ 
    18 (Instructions) 
    19  
    20 https://gist.github.com/svemir/4207353 
    21 (Hadoop related) A Common Crawl Experiment 
    22  
    23 https://gist.github.com/Smerity/afe7430fdb4371015466 
    24  
    25  Extract just the text from Common Crawl WARC WET files 
    26   
    27 https://stackoverflow.com/tags/common-crawl/hot?filter=all 
    28  
    29 https://stackoverflow.com/questions/45920527/get-offset-and-length-of-a-subset-of-a-wat-archive-from-common-crawl-index-serve/46152773#46152773 
    30  
    31  
    32 "The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same." 
    33  
    34 https://dmorgan.info/posts/common-crawl-python/ 
    35 https://groups.google.com/forum/#!topic/common-crawl/pdI3w09AAbQ 
    36  
    37 Example: 
    38 WARC: 
    39 tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/crawldiagnostics/CC-MAIN-20190719115720-20190719141720-00077.warc.gz 
    40 WET: 
    41 tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/wet/CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz 
    42 tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz  
    43