- Timestamp:
- 2019-08-12T20:35:48+12:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/other.txt
r33376 r33404 3 3 4 4 https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps 5 6 7 http://www.basicsbehind.com/extract-text-webpage/ 8 http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf 9 https://stackoverflow.com/questions/16649535/access-a-common-crawl-aws-public-dataset/25297965#25297965 10 11 12 http://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/ 13 File List #Files Total Size Compressed (TiB) 14 WET files CC-MAIN-2019-26/wet.paths.gz 56000 7.59 15 16 17 http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ 18 (Instructions) 19 20 https://gist.github.com/svemir/4207353 21 22 https://gist.github.com/Smerity/afe7430fdb4371015466 23 24 Extract just the text from Common Crawl WARC WET files 25 26 https://stackoverflow.com/tags/common-crawl/hot?filter=all 27 28 https://stackoverflow.com/questions/45920527/get-offset-and-length-of-a-subset-of-a-wat-archive-from-common-crawl-index-serve/46152773#46152773 29 30 31 "The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same." 32 33 https://dmorgan.info/posts/common-crawl-python/
Note:
See TracChangeset
for help on using the changeset viewer.