Changeset 33409
- Timestamp:
- 2019-08-13T15:59:29+12:00 (4 years ago)
- Location:
- gs3-extensions/maori-lang-detection/MoreReading
- Files:
-
- 2 added
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33393 r33409 1 2 WET FILES: 3 4 https://stackoverflow.com/questions/16649535/access-a-common-crawl-aws-public-dataset/25297965#25297965 5 6 7 http://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/ 8 File List #Files Total Size Compressed (TiB) 9 WET files CC-MAIN-2019-26/wet.paths.gz 56000 7.59 10 11 12 http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ 13 (Instructions) 14 15 https://gist.github.com/svemir/4207353 16 (Hadoop related) A Common Crawl Experiment 17 18 https://gist.github.com/Smerity/afe7430fdb4371015466 19 20 Extract just the text from Common Crawl WARC WET files 21 22 https://stackoverflow.com/tags/common-crawl/hot?filter=all 23 24 https://stackoverflow.com/questions/45920527/get-offset-and-length-of-a-subset-of-a-wat-archive-from-common-crawl-index-serve/46152773#46152773 25 26 27 "The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same." 28 29 https://dmorgan.info/posts/common-crawl-python/ 30 https://groups.google.com/forum/#!topic/common-crawl/pdI3w09AAbQ 31 32 Example: 33 WARC: 34 tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/crawldiagnostics/CC-MAIN-20190719115720-20190719141720-00077.warc.gz 35 WET: 36 tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/wet/CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz 37 tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz 38 39 40 -------------------------------------------- 1 41 http://webdatacommons.org/ 2 42 -
gs3-extensions/maori-lang-detection/MoreReading/other.txt
r33408 r33409 5 5 6 6 7 http://www.basicsbehind.com/extract-text-webpage/8 http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf9 https://stackoverflow.com/questions/16649535/access-a-common-crawl-aws-public-dataset/25297965#2529796510 11 12 http://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/13 File List #Files Total Size Compressed (TiB)14 WET files CC-MAIN-2019-26/wet.paths.gz 56000 7.5915 16 17 http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/18 (Instructions)19 20 https://gist.github.com/svemir/420735321 (Hadoop related) A Common Crawl Experiment22 23 https://gist.github.com/Smerity/afe7430fdb437101546624 25 Extract just the text from Common Crawl WARC WET files26 27 https://stackoverflow.com/tags/common-crawl/hot?filter=all28 29 https://stackoverflow.com/questions/45920527/get-offset-and-length-of-a-subset-of-a-wat-archive-from-common-crawl-index-serve/46152773#4615277330 31 32 "The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same."33 34 https://dmorgan.info/posts/common-crawl-python/35 https://groups.google.com/forum/#!topic/common-crawl/pdI3w09AAbQ36 37 Example:38 WARC:39 tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/crawldiagnostics/CC-MAIN-20190719115720-20190719141720-00077.warc.gz40 WET:41 tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/wet/CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz42 tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz43
Note:
See TracChangeset
for help on using the changeset viewer.