Changeset 33414

Show
Ignore:
Timestamp:
13.08.2019 21:57:58 (9 days ago)
Author:
ak19
Message:

Adding important links

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33409 r33414  
     1There's already python code for getting text: 
     2 
     3https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands 
     4https://gist.github.com/Smerity/afe7430fdb4371015466 
     5 
     6https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands 
     7 
     8"But it turns out - it is not. This can be attributed to the effort that has been made to make the CC more accessible. The killer feature for me was the presence of their index weighting only ~200Gb, that also features a language detection option, i.e. you do not need to analyze top-level-domains or do any significant data mining." 
     9 
     10What does the "language detection option" discussion above mean? 
     11 
     12------------ 
     13Skipping CrawlDiagnostics (see below) and robots.txt gz files: 
     14 
     15http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/ 
     16 
     17"HTTP 304 notmodified" responses are now stored as WARC revisit records in the "crawldiagnostics" subset along with 404s, redirects and other non-200 responses. For now the revisit records contain a payload digest although there is no payload sent together with HTTP 304 responses. The stupid reason is that the columnar index requires the digest field and we want to make sure that all tools continue to work as expected. The SHA-1 digest of an empty payload (zero bytes) is used for the revisit records. 
     18 
     19http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit 
     20‘revisit’ 
     21General 
     22 
     23A ‘revisit’ record describes the revisitation of content already archived, and might include only an abbreviated content body which has to be interpreted relative to a previous record. Most typically, a ‘revisit’ record is used instead of a ‘response’ or ‘resource’ record to indicate that the content visited was either a complete or substantial duplicate of material previously archived. 
     24... 
     25 
     26------- 
    127 
    228WET FILES: