Changeset 33419
- Timestamp:
- 2019-08-15T16:20:03+12:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33414 r33419 1 https://commoncrawl.github.io/cc-crawl-statistics/plots/languages 2 http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/ 3 4 The JSON for the index files (that we downloaded for .nz) already contained a "languages:" field. The above page mentions that this shows the primary, upto 3, detected languages of the document. 5 6 "Language Annotations 7 8 We now run the Compact Language Detector 2 (CLD2) on HTML pages to identify the language of a document. CLD2 is able to identify 160 different languages and up to 3 languages per document. The detected languages resp. the ISO-639-3 code are shown in the URL index as a new field, e.g., "languages": "zho,eng". The WARC metadata records contain the full CLD2 response including scores and text coverage: 9 10 languages-cld2: {"reliable":true,"text-bytes":3783,"languages":[{"code":"zh","code-iso-639-3":"zho","text-covered":0.93,"score":1943.0,"name":"Chinese"},{"code":"en","code-iso-639-3":"eng","text-covered":0.05,"score":523.0,"name":"ENGLISH"}]} 11 12 On github youâll find the Java bindings to the CLD2 native library and the distribution of the primary document languages as part of our crawl statistics. 13 14 Please note that the columnar index does not contain the detected languages for now. " 15 16 17 http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/ 18 "the columnar index contains the content language of a web page as a new field. Please read the instructions below how to upgrade your tools to read newly added fields." 19 20 http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/ 21 22 --- 23 24 https://www.aclweb.org/anthology/L16-1443 (2016, as per https://pbn.nauka.gov.pl/sedno-webapp/getReport/38108) 25 26 https://dkpro.github.io/dkpro-c4corpus/ 27 "DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal." 28 29 https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/#_including_c4corpustools_in_your_java_projects 30 - Including C4CorpusTools in your Java projects 31 - Working with C4Corpus - Word count example 32 33 https://github.com/farhansiddiqui/webscale_nlp 34 35 https://github.com/commoncrawl/language-detection-cld2 36 --------- 1 37 There's already python code for getting text: 2 38
Note:
See TracChangeset
for help on using the changeset viewer.