Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33419

Timestamp:

2019-08-15T16:20:03+12:00 (5 years ago)

Author:

ak19

Message:

Last evening, I had found some links about how language-detection is done and language info is stored in common-crawl data. Committing these links now instead of later, so I have immediate access to them from other computers.

File:

: 1 edited

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

-              r33414
+              r33419
+https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
+http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/
+    The JSON for the index files (that we downloaded for .nz) already contained a "languages:" field. The above page mentions that this shows the primary, upto 3, detected languages of the document.
+"Language Annotations
+We now run the Compact Language Detector 2 (CLD2) on HTML pages to identify the language of a document. CLD2 is able to identify 160 different languages and up to 3 languages per document. The detected languages resp. the ISO-639-3 code are shown in the URL index as a new field, e.g., "languages": "zho,eng". The WARC metadata records contain the full CLD2 response including scores and text coverage:
+languages-cld2: {"reliable":true,"text-bytes":3783,"languages":[{"code":"zh","code-iso-639-3":"zho","text-covered":0.93,"score":1943.0,"name":"Chinese"},{"code":"en","code-iso-639-3":"eng","text-covered":0.05,"score":523.0,"name":"ENGLISH"}]}
+On github youâll find the Java bindings to the CLD2 native library and the distribution of the primary document languages as part of our crawl statistics.
+Please note that the columnar index does not contain the detected languages for now. "
+http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
+"the columnar index contains the content language of a web page as a new field. Please read the instructions below how to upgrade your tools to read newly added fields."
+http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
+---
+https://www.aclweb.org/anthology/L16-1443 (2016, as per https://pbn.nauka.gov.pl/sedno-webapp/getReport/38108)
+https://dkpro.github.io/dkpro-c4corpus/
+"DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal."
+https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/#_including_c4corpustools_in_your_java_projects
+- Including C4CorpusTools in your Java projects
+- Working with C4Corpus - Word count example
+https://github.com/farhansiddiqui/webscale_nlp
+https://github.com/commoncrawl/language-detection-cld2
+---------
 There's already python code for getting text:

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33419

Legend:

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

Download in other formats: