Changeset 33414 for gs3-extensions
- Timestamp:
- 2019-08-13T21:57:58+12:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33409 r33414 1 There's already python code for getting text: 2 3 https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands 4 https://gist.github.com/Smerity/afe7430fdb4371015466 5 6 https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands 7 8 "But it turns out - it is not. This can be attributed to the effort that has been made to make the CC more accessible. The killer feature for me was the presence of their index weighting only ~200Gb, that also features a language detection option, i.e. you do not need to analyze top-level-domains or do any significant data mining." 9 10 What does the "language detection option" discussion above mean? 11 12 ------------ 13 Skipping CrawlDiagnostics (see below) and robots.txt gz files: 14 15 http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/ 16 17 "HTTP 304 notmodified" responses are now stored as WARC revisit records in the "crawldiagnostics" subset along with 404s, redirects and other non-200 responses. For now the revisit records contain a payload digest although there is no payload sent together with HTTP 304 responses. The stupid reason is that the columnar index requires the digest field and we want to make sure that all tools continue to work as expected. The SHA-1 digest of an empty payload (zero bytes) is used for the revisit records. 18 19 http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit 20 ârevisitâ 21 General 22 23 A ârevisitâ record describes the revisitation of content already archived, and might include only an abbreviated content body which has to be interpreted relative to a previous record. Most typically, a ârevisitâ record is used instead of a âresponseâ or âresourceâ record to indicate that the content visited was either a complete or substantial duplicate of material previously archived. 24 ... 25 26 ------- 1 27 2 28 WET FILES:
Note:
See TracChangeset
for help on using the changeset viewer.