Changeset 33419

Show
Ignore:
Timestamp:
15.08.2019 16:20:03 (7 days ago)
Author:
ak19
Message:

Last evening, I had found some links about how language-detection is done and language info is stored in common-crawl data. Committing these links now instead of later, so I have immediate access to them from other computers.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33414 r33419  
     1https://commoncrawl.github.io/cc-crawl-statistics/plots/languages 
     2http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/ 
     3 
     4    The JSON for the index files (that we downloaded for .nz) already contained a "languages:" field. The above page mentions that this shows the primary, upto 3, detected languages of the document. 
     5 
     6"Language Annotations 
     7 
     8We now run the Compact Language Detector 2 (CLD2) on HTML pages to identify the language of a document. CLD2 is able to identify 160 different languages and up to 3 languages per document. The detected languages resp. the ISO-639-3 code are shown in the URL index as a new field, e.g., "languages": "zho,eng". The WARC metadata records contain the full CLD2 response including scores and text coverage: 
     9 
     10languages-cld2: {"reliable":true,"text-bytes":3783,"languages":[{"code":"zh","code-iso-639-3":"zho","text-covered":0.93,"score":1943.0,"name":"Chinese"},{"code":"en","code-iso-639-3":"eng","text-covered":0.05,"score":523.0,"name":"ENGLISH"}]} 
     11 
     12On github you’ll find the Java bindings to the CLD2 native library and the distribution of the primary document languages as part of our crawl statistics. 
     13 
     14Please note that the columnar index does not contain the detected languages for now. " 
     15 
     16     
     17http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/ 
     18"the columnar index contains the content language of a web page as a new field. Please read the instructions below how to upgrade your tools to read newly added fields." 
     19 
     20http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/ 
     21 
     22--- 
     23 
     24https://www.aclweb.org/anthology/L16-1443 (2016, as per https://pbn.nauka.gov.pl/sedno-webapp/getReport/38108) 
     25 
     26https://dkpro.github.io/dkpro-c4corpus/ 
     27"DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal." 
     28 
     29https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/#_including_c4corpustools_in_your_java_projects 
     30- Including C4CorpusTools in your Java projects 
     31- Working with C4Corpus - Word count example 
     32 
     33https://github.com/farhansiddiqui/webscale_nlp 
     34 
     35https://github.com/commoncrawl/language-detection-cld2 
     36--------- 
    137There's already python code for getting text: 
    238