Changeset 33404

Show
Ignore:
Timestamp:
12.08.2019 20:35:48 (10 days ago)
Author:
ak19
Message:

1. Links to other Java ways of extracting text from web content. 2. Getting at commoncrawl WET files

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/other.txt

    r33376 r33404  
    33 
    44https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps 
     5 
     6 
     7http://www.basicsbehind.com/extract-text-webpage/ 
     8    http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf 
     9https://stackoverflow.com/questions/16649535/access-a-common-crawl-aws-public-dataset/25297965#25297965 
     10 
     11 
     12http://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/ 
     13            File List   #Files  Total Size Compressed (TiB) 
     14    WET files   CC-MAIN-2019-26/wet.paths.gz    56000   7.59 
     15 
     16 
     17http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ 
     18(Instructions) 
     19 
     20https://gist.github.com/svemir/4207353 
     21 
     22https://gist.github.com/Smerity/afe7430fdb4371015466 
     23 
     24 Extract just the text from Common Crawl WARC WET files 
     25  
     26https://stackoverflow.com/tags/common-crawl/hot?filter=all 
     27 
     28https://stackoverflow.com/questions/45920527/get-offset-and-length-of-a-subset-of-a-wat-archive-from-common-crawl-index-serve/46152773#46152773 
     29 
     30 
     31"The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same." 
     32 
     33https://dmorgan.info/posts/common-crawl-python/