source: gs3-extensions/maori-lang-detection/README.txt@ 33335

Last change on this file since 33335 was 33335, checked in by ak19, 5 years ago

First java file for Māori language detection using openNLP with the LanguageDetectionModel. Instructions and reading links are within the README.txt. Though maybe I don't need to commit apache's openNLP binary zip and the LanguageDetectionModel.bin (actually a zip) file, I'm including that too. Near the end of the README.txt instructions, it covers the steps on how to compile and run the new Java file called MaoriDetector.java. At present, this rudimentary class takes a hardcoded 2 line sentence in Māori taken from our uni website as input and correctly choose mri (3 letter lang code for Māori) as the best predicted language it detected, at over 60%.

File size: 6.7 KB
Line 
1http://opennlp.apache.org/news/model-langdetect-183.html
2Language Detector Model for Apache OpenNLP released
3
40. Can we make it run
5Can we detect it with with Eng, Fr, NL docs
6
71. openNLP - is Maori included,
8if not learn how to teach it to recognise Maori
9how you run the training bit of their software and add in Maori language training set
10
11ANSWER: Yes, Maori is included, see https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
12
132. Add in a small program to detect particularly Maori
14
153. Macron and non-macron input text language recognition support?
16
17-----------------------------------------
18
19READING:
20 General:
21 * https://stackoverflow.com/questions/7670427/how-does-language-detection-work
22 * https://github.com/andreasjansson/language-detection.el
23 * https://en.wikipedia.org/wiki/ISO_639-3
24
25 Specific:
26 * openNLP download: http://opennlp.apache.org/download.html
27 "Models
28 The models for Apache OpenNLP are found here.
29 The models can be used for testing or getting started, please train your own models for all other use cases."
30
31 * On LanguageDetectionModel for OpenNLP:
32 http://opennlp.apache.org/news/model-langdetect-183.html
33 https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
34 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
35 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
36 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
37 * Good precision etc on detecting Maori: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/langdetect-183.bin.report.txt
38 * Maybe useful: http://opennlp.sourceforge.net/models-1.5/
39 "Use the links in the table below to download the pre-trained models for the OpenNLP 1.5 series."
40
41
42
431. Download OpenNLP from https://opennlp.apache.org/download.html.
44I got the binary version. I unzipped it.
45
462. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html
47 Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin
48The page says:
49 "All models are zip compressed (like a jar file), they must not be uncompressed."
50- So when langdetect-183.bin is renamed to langdetect-183.zip, can inspect the contents.
51- But for using with openNLP, don't rename or unzip it.
52
533. UNNECESSARY:
54I started by following the instructions at the botton of:
55https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
56 Note that the svn checkout from the leipzig data link is too huge. So I restricted it to just Maori, English, Dutch and French text, see further below.
57
58These instructions proved ultimately unnecessary, as the downloaded openNLP model for language detection, langdetect-183.bin, already contains all that
59
60
61Individual downloading of Leipzig corpora for just Maori or specific languages can be found via:
62 http://wortschatz.uni-leipzig.de/en/download/
63
64[svn co https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig]
65https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/
66Here's the svn web view for the leipzig/data language files: https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data/
67
68[Here's the svn web view of apache.org: http://svn.eu.apache.org/viewvc
69Unfortunately this is other apache stuff, including openNLP, but not the leipzig language data files I was after]
70
71
72 svn co https://svn.apache.org/repos/bigdata/opennlp/trunk --depth immediates
73 mv trunk opennlp-corpus
74 svn up --set-depth immediates
75 cd resources/
76 wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity
77 cd ../data
78 svn up mri_web_2011_100K-sentences.txt
79 svn up eng_wikipedia_2012_3M-sentences.txt
80 svn up nld_mixed_2012_1M-sentences.txt
81 svn up fra_mixed_2009_1M-sentences.txt
82
83 cd ..
84 # in opennlp-corpus/leipzig
85 chmod u+x create_langdetect_model.sh
86
87
88 cd /Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
89 export OPENNLP_HOME=`pwd`
90 ./create_langdetect_model.sh mri_web_2011_100K-sentences.txt
91 ./create_langdetect_model.sh nld_mixed_2012_1M-sentences.txt
92
93
944. Attempting running OpenNLP + LanguageDetectorModel's command line tools for language prediction
95
96I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages.
97
98
99http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
100http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
101http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
102 "Usage: opennlp LanguageDetector model < documents"
103
104# must have exported OPENNLP_HOME (maybe add its bin to PATH?)
105# Following the Usage instruction just above:
106 $ cd /Scratch/ak19/openNLP-lang-detect
107 $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt
108
109 [# Sending all output into a file for inspection:
110 $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt > bla.txt 2>&1
111 ]
112
113
114
115 [# Didn't work, even though the first opennlp manual link above said that the "opennlp LangDetector model" command should take input from stdin:
116
117 ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < "Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei."
118 ]
119
120
1215. For writing Java code:
122To write the basic code, I followed the Java skeleton examples at
123 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
124 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
125
126
127To get the imports correct, I searched for example/tutorial java code on using openNLP with the LanguageDetectorModel
128* Java code: Import files:
129https://www.tutorialkart.com/opennlp/language-detector-example-in-apache-opennlp/
130* The tutorial link above also covers Java code to train detecting a particular language.
131
132
133apache-opennlp-1.9.1/lib/opennlp-tools-1.9.1.jar contains the imports we want, in particular
134 import opennlp.tools.langdetect.*;
135 import opennlp.tools.util.*;
136
1376. Wrote the very basic form of MaoriDetector.java class.
138
139To compile and run:
140
141a. export OPENNLP_HOME=/Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
142b. wharariki:[115]/Scratch/ak19/openNLP-lang-detect/src>javac -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector.java
143c. wharariki:[116]/Scratch/ak19/openNLP-lang-detect/src>java -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector
144
145
146(Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar)
Note: See TracBrowser for help on using the repository browser.