Changeset 33419

2019-08-15T16:20:03+12:00 (4 years ago)

Last evening, I had found some links about how language-detection is done and language info is stored in common-crawl data. Committing these links now instead of later, so I have immediate access to them from other computers.

1 edited


  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33414 r33419  
     4    The JSON for the index files (that we downloaded for .nz) already contained a "languages:" field. The above page mentions that this shows the primary, upto 3, detected languages of the document.
     6"Language Annotations
     8We now run the Compact Language Detector 2 (CLD2) on HTML pages to identify the language of a document. CLD2 is able to identify 160 different languages and up to 3 languages per document. The detected languages resp. the ISO-639-3 code are shown in the URL index as a new field, e.g., "languages": "zho,eng". The WARC metadata records contain the full CLD2 response including scores and text coverage:
     10languages-cld2: {"reliable":true,"text-bytes":3783,"languages":[{"code":"zh","code-iso-639-3":"zho","text-covered":0.93,"score":1943.0,"name":"Chinese"},{"code":"en","code-iso-639-3":"eng","text-covered":0.05,"score":523.0,"name":"ENGLISH"}]}
     12On github you’ll find the Java bindings to the CLD2 native library and the distribution of the primary document languages as part of our crawl statistics.
     14Please note that the columnar index does not contain the detected languages for now. "
     18"the columnar index contains the content language of a web page as a new field. Please read the instructions below how to upgrade your tools to read newly added fields."
     24 (2016, as per
     27"DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal."
     30- Including C4CorpusTools in your Java projects
     31- Working with C4Corpus - Word count example
    137There's already python code for getting text:
Note: See TracChangeset for help on using the changeset viewer.