source: gs3-extensions/maori-lang-detection/README.txt@ 33350

Last change on this file since 33350 was 33350, checked in by ak19, 5 years ago

Better comments. Tested macronised vs unmacronised Māori language test string and both are detected as mri, but the unmacronised is detected with lower confidence. Added a note on that in the README.

File size: 10.5 KB
RevLine 
[33339]1------------------
2BASIC README
3------------------
40. The code and its necessary helper files and libraries, and this README, live at:
5 http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection
6
7You can checkout from svn with:
8 svn co http://svn.greenstone.org/gs3-extensions/maori-lang-detection
9
10- It contains the OpenNLP that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html
11- The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin)
12
131. Once you've svn checked it out, create a folder called "models" and put the langdetect-183.bin file into it.
14(This is just a zip file, but has to remain with the .bin extension in order for OpenNLP to use it. If you ever wish to see its contents, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.)
15
162. Next extract the apache-opennlp-1.9.1-bin.tar.gz.
17This will create a folder called apache-opennlp-1.9.1. Move the "models" folder you created in step 1 into this folder.
18
193. Before you can compile or run the MaoriTextDetector program, you always have to prepare a terminal by setting up the environment for OpenNLP as follows:
20 cd /type/here/path/to/your/extracted/apache-opennlp-1.9.1
21 export OPENNLP_HOME=`pwd`
22
23
244. If you want to recompile, go up into the checked out maori-lang-detection folder's "src" subfolder. To compile, make sure you have the JDK7+ bin folder on your PATH environment variable.
25Still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, you can now run:
26 maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java
27
285. To run the MaoriTextDetector program, you will need the JDK or JRE 7+ bin folder on your PATH and still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3,
29type one of the following:
30 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --help
31 (prints the usage, including other options)
32
33 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --file <full/path/to/textfile>
34
35 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector -
36 which expects text to stream in from standard input.
37 If entering text manually, then remember to press Ctrl-D to indicate the usual end of StdIn.
38
39
[33350]40For links to background reading materials, see the OLD README section further below.
[33339]41
42
[33350]43NOTE: The OpenNLP Language Detection Model can detect non-macronised Māori text too,
44but as anticipated, the same text produces a lower confidence level for the language prediction. Compare:
[33339]45
[33350]46$maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector -
47 Waiting to read text from STDIN... (press Ctrl-D when done entering text)>
48 Ko tenei te Whare Wananga o Waikato e whakatau nei i nga iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o nga maunga whakaruru e tau awhi nei.
49 Best language: mri
50 Best language confidence: 0.5959533972070814
51 Exitting program with returnVal 0...
52
53$maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector -
54 Waiting to read text from STDIN... (press Ctrl-D when done entering text)>
55 Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei.
56 Best language: mri
57 Best language confidence: 0.6825737450092515
58 Exitting program with returnVal 0...
59
60
[33339]61-------------------------
62 OLD README
63-------------------------
64
65
[33335]66http://opennlp.apache.org/news/model-langdetect-183.html
67Language Detector Model for Apache OpenNLP released
68
690. Can we make it run
70Can we detect it with with Eng, Fr, NL docs
71
721. openNLP - is Maori included,
73if not learn how to teach it to recognise Maori
74how you run the training bit of their software and add in Maori language training set
75
76ANSWER: Yes, Maori is included, see https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
77
782. Add in a small program to detect particularly Maori
79
803. Macron and non-macron input text language recognition support?
81
82-----------------------------------------
83
84READING:
85 General:
86 * https://stackoverflow.com/questions/7670427/how-does-language-detection-work
87 * https://github.com/andreasjansson/language-detection.el
88 * https://en.wikipedia.org/wiki/ISO_639-3
89
90 Specific:
91 * openNLP download: http://opennlp.apache.org/download.html
92 "Models
93 The models for Apache OpenNLP are found here.
94 The models can be used for testing or getting started, please train your own models for all other use cases."
95
96 * On LanguageDetectionModel for OpenNLP:
97 http://opennlp.apache.org/news/model-langdetect-183.html
98 https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
99 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
100 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
101 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
102 * Good precision etc on detecting Maori: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/langdetect-183.bin.report.txt
103 * Maybe useful: http://opennlp.sourceforge.net/models-1.5/
104 "Use the links in the table below to download the pre-trained models for the OpenNLP 1.5 series."
105
106
107
1081. Download OpenNLP from https://opennlp.apache.org/download.html.
109I got the binary version. I unzipped it.
110
1112. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html
112 Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin
113The page says:
114 "All models are zip compressed (like a jar file), they must not be uncompressed."
115- So when langdetect-183.bin is renamed to langdetect-183.zip, can inspect the contents.
116- But for using with openNLP, don't rename or unzip it.
117
1183. UNNECESSARY:
119I started by following the instructions at the botton of:
120https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
121 Note that the svn checkout from the leipzig data link is too huge. So I restricted it to just Maori, English, Dutch and French text, see further below.
122
123These instructions proved ultimately unnecessary, as the downloaded openNLP model for language detection, langdetect-183.bin, already contains all that
124
125
126Individual downloading of Leipzig corpora for just Maori or specific languages can be found via:
127 http://wortschatz.uni-leipzig.de/en/download/
128
129[svn co https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig]
130https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/
131Here's the svn web view for the leipzig/data language files: https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data/
132
133[Here's the svn web view of apache.org: http://svn.eu.apache.org/viewvc
134Unfortunately this is other apache stuff, including openNLP, but not the leipzig language data files I was after]
135
136
137 svn co https://svn.apache.org/repos/bigdata/opennlp/trunk --depth immediates
138 mv trunk opennlp-corpus
139 svn up --set-depth immediates
140 cd resources/
141 wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity
142 cd ../data
143 svn up mri_web_2011_100K-sentences.txt
144 svn up eng_wikipedia_2012_3M-sentences.txt
145 svn up nld_mixed_2012_1M-sentences.txt
146 svn up fra_mixed_2009_1M-sentences.txt
147
148 cd ..
149 # in opennlp-corpus/leipzig
150 chmod u+x create_langdetect_model.sh
151
152
153 cd /Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
154 export OPENNLP_HOME=`pwd`
155 ./create_langdetect_model.sh mri_web_2011_100K-sentences.txt
156 ./create_langdetect_model.sh nld_mixed_2012_1M-sentences.txt
157
158
1594. Attempting running OpenNLP + LanguageDetectorModel's command line tools for language prediction
160
161I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages.
162
163
164http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
165http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
166http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
167 "Usage: opennlp LanguageDetector model < documents"
168
169# must have exported OPENNLP_HOME (maybe add its bin to PATH?)
170# Following the Usage instruction just above:
171 $ cd /Scratch/ak19/openNLP-lang-detect
172 $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt
173
174 [# Sending all output into a file for inspection:
175 $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt > bla.txt 2>&1
176 ]
177
178
179
180 [# Didn't work, even though the first opennlp manual link above said that the "opennlp LangDetector model" command should take input from stdin:
181
182 ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < "Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei."
183 ]
184
185
1865. For writing Java code:
187To write the basic code, I followed the Java skeleton examples at
188 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
189 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
190
191
192To get the imports correct, I searched for example/tutorial java code on using openNLP with the LanguageDetectorModel
193* Java code: Import files:
194https://www.tutorialkart.com/opennlp/language-detector-example-in-apache-opennlp/
195* The tutorial link above also covers Java code to train detecting a particular language.
196
197
198apache-opennlp-1.9.1/lib/opennlp-tools-1.9.1.jar contains the imports we want, in particular
199 import opennlp.tools.langdetect.*;
200 import opennlp.tools.util.*;
201
2026. Wrote the very basic form of MaoriDetector.java class.
203
204To compile and run:
205
206a. export OPENNLP_HOME=/Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
207b. wharariki:[115]/Scratch/ak19/openNLP-lang-detect/src>javac -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector.java
208c. wharariki:[116]/Scratch/ak19/openNLP-lang-detect/src>java -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector
209
210
211(Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar)
Note: See TracBrowser for help on using the repository browser.