1 | ------------------
|
---|
2 | BASIC README
|
---|
3 | ------------------
|
---|
4 | 0. The code and its necessary helper files and libraries, and this README, live at:
|
---|
5 | http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection
|
---|
6 |
|
---|
7 | You can checkout from svn with:
|
---|
8 | svn co http://svn.greenstone.org/gs3-extensions/maori-lang-detection
|
---|
9 |
|
---|
10 | This checkout contains:
|
---|
11 | - tarball of the OpenNLP version that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html
|
---|
12 | - the folder "models-trainingdata-and-sampletxts", itself containing:
|
---|
13 | - langdetect-183.bin: The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin)
|
---|
14 | - mri-sent_trained.bin: our own custom generated model for Maori Sentence Detection
|
---|
15 | - mri-sent.train: the training text file for generating the mri-sent_trained.bin Maori Sentence Detection model
|
---|
16 | - sample_mri_paragraphs.txt: contains some text to test the Maori Sentence Detection model on. Its content is from Niupepa collection page http://www.nzdl.org/cgi-bin/library?gg=text&e=p-00000-00---off-0niupepa--00-0----0-10-0---0---0direct-10---4-------0-1l--11-en-50---20-about---00-0-1-00-0-0-11-1-0utfZz-8-00&a=p&p=about&l=mi&nw=utf-8)
|
---|
17 | - sample_maori_shorttext.txt: to test the MaoriTextDetector.java with
|
---|
18 | - "src" folder for Java classes, currently just MaoriTextDetector.java and its classfile. MaoriTextDetector.java uses the aforementioned LanguageDetectionModel, langdetect-183.bin, to detect whether input text from a file or stdin is in Maori or not
|
---|
19 | - gen_SentenceDetection_model.sh, our custom script that generates both the mri-sent.train and model mri-sent_trained.bin files mentioned above
|
---|
20 | - the script works on opennlp's leipzig corpus of 100k Maori sentences from 2011 to get its sample sentences into the correct format in the mri-sent.train file
|
---|
21 | - from this file containing training sentences, it generates the Sentence Detector Model, mri-sent_trained.bin
|
---|
22 | - mri-opennlp-corpus.tar.gz: a tarball containing the 100k Maori sentences opennlp corpus checked out with svn in its original directory structure from https://svn.apache.org/repos/bigdata/opennlp/trunk/mri_web_2011_100K-sentences.txt
|
---|
23 |
|
---|
24 |
|
---|
25 | 1. Once you've svn checked out the maori-lang-detection from gs3-extensions, create a folder called "models" and put the files langdetect-183.bin and mri-sent_trained.bin from the folder "models-trainingdata-and-sampletxts" into it.
|
---|
26 | (These are just zip files, but have to remain with the .bin extension in order for OpenNLP to use them. If you ever wish to see the contents of such a .bin file, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.)
|
---|
27 |
|
---|
28 | You can optionally put the file models-trainingdata-and-sampletxts/
|
---|
29 |
|
---|
30 |
|
---|
31 | 2. Next extract the apache-opennlp-1.9.1-bin.tar.gz.
|
---|
32 | This will create a folder called apache-opennlp-1.9.1. Move the "models" folder you created in step 1 into this folder.
|
---|
33 |
|
---|
34 |
|
---|
35 | 3. Before you can compile or run the MaoriTextDetector program, you always have to prepare a terminal by setting up the environment for OpenNLP as follows:
|
---|
36 | cd /type/here/path/to/your/extracted/apache-opennlp-1.9.1
|
---|
37 | export OPENNLP_HOME=`pwd`
|
---|
38 |
|
---|
39 |
|
---|
40 | 4. If you want to recompile, go up into the checked out maori-lang-detection folder's "src" subfolder. To compile, make sure you have the JDK7+ bin folder on your PATH environment variable.
|
---|
41 | Still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, you can now run:
|
---|
42 | maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java
|
---|
43 |
|
---|
44 |
|
---|
45 | 5. To run the MaoriTextDetector program, you will need the JDK or JRE 7+ bin folder on your PATH and still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3,
|
---|
46 | type one of the following:
|
---|
47 | maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --help
|
---|
48 | (prints the usage, including other options)
|
---|
49 |
|
---|
50 | maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --file <full/path/to/textfile>
|
---|
51 |
|
---|
52 | maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector -
|
---|
53 | which expects text to stream in from standard input.
|
---|
54 | If entering text manually, then remember to press Ctrl-D to indicate the usual end of StdIn.
|
---|
55 |
|
---|
56 |
|
---|
57 | For links to background reading materials, see the OLD README section further below.
|
---|
58 |
|
---|
59 |
|
---|
60 | NOTE: The OpenNLP Language Detection Model can detect non-macronised MÄori text too,
|
---|
61 | but as anticipated, the same text produces a lower confidence level for the language prediction. Compare:
|
---|
62 |
|
---|
63 | $maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector -
|
---|
64 | Waiting to read text from STDIN... (press Ctrl-D when done entering text)>
|
---|
65 | Ko tenei te Whare Wananga o Waikato e whakatau nei i nga iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o nga maunga whakaruru e tau awhi nei.
|
---|
66 | Best language: mri
|
---|
67 | Best language confidence: 0.5959533972070814
|
---|
68 | Exitting program with returnVal 0...
|
---|
69 |
|
---|
70 | $maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector -
|
---|
71 | Waiting to read text from STDIN... (press Ctrl-D when done entering text)>
|
---|
72 | Ko tÄnei te Whare WÄnanga o Waikato e whakatau nei i ngÄ iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngÄ maunga whakaruru e tau awhi nei.
|
---|
73 | Best language: mri
|
---|
74 | Best language confidence: 0.6825737450092515
|
---|
75 | Exitting program with returnVal 0...
|
---|
76 |
|
---|
77 |
|
---|
78 | -------------------------
|
---|
79 | OLD README
|
---|
80 | -------------------------
|
---|
81 |
|
---|
82 |
|
---|
83 | http://opennlp.apache.org/news/model-langdetect-183.html
|
---|
84 | Language Detector Model for Apache OpenNLP released
|
---|
85 |
|
---|
86 | 0. Can we make it run
|
---|
87 | Can we detect it with with Eng, Fr, NL docs
|
---|
88 |
|
---|
89 | 1. openNLP - is Maori included,
|
---|
90 | if not learn how to teach it to recognise Maori
|
---|
91 | how you run the training bit of their software and add in Maori language training set
|
---|
92 |
|
---|
93 | ANSWER: Yes, Maori is included, see https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
|
---|
94 |
|
---|
95 | 2. Add in a small program to detect particularly Maori
|
---|
96 |
|
---|
97 | 3. Macron and non-macron input text language recognition support?
|
---|
98 |
|
---|
99 | -----------------------------------------
|
---|
100 |
|
---|
101 | READING:
|
---|
102 | General:
|
---|
103 | * https://stackoverflow.com/questions/7670427/how-does-language-detection-work
|
---|
104 | * https://github.com/andreasjansson/language-detection.el
|
---|
105 | * https://en.wikipedia.org/wiki/ISO_639-3
|
---|
106 |
|
---|
107 | Specific:
|
---|
108 | * openNLP download: http://opennlp.apache.org/download.html
|
---|
109 | "Models
|
---|
110 | The models for Apache OpenNLP are found here.
|
---|
111 | The models can be used for testing or getting started, please train your own models for all other use cases."
|
---|
112 |
|
---|
113 | * On LanguageDetectionModel for OpenNLP:
|
---|
114 | http://opennlp.apache.org/news/model-langdetect-183.html
|
---|
115 | https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
|
---|
116 | * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
|
---|
117 | * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
|
---|
118 | * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
|
---|
119 | * Good precision etc on detecting Maori: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/langdetect-183.bin.report.txt
|
---|
120 | * Maybe useful: http://opennlp.sourceforge.net/models-1.5/
|
---|
121 | "Use the links in the table below to download the pre-trained models for the OpenNLP 1.5 series."
|
---|
122 |
|
---|
123 |
|
---|
124 |
|
---|
125 | 1. Download OpenNLP from https://opennlp.apache.org/download.html.
|
---|
126 | I got the binary version. I unzipped it.
|
---|
127 |
|
---|
128 |
|
---|
129 | 2. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html
|
---|
130 | Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin
|
---|
131 | The page says:
|
---|
132 | "All models are zip compressed (like a jar file), they must not be uncompressed."
|
---|
133 | - So when langdetect-183.bin is renamed to langdetect-183.zip, can inspect the contents.
|
---|
134 | - But for using with openNLP, don't rename or unzip it.
|
---|
135 |
|
---|
136 |
|
---|
137 | 3. UNNECESSARY:
|
---|
138 | I started by following the instructions at the botton of:
|
---|
139 | https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
|
---|
140 | Note that the svn checkout from the leipzig data link is too huge. So I restricted it to just Maori, English, Dutch and French text, see further below.
|
---|
141 |
|
---|
142 | These instructions proved ultimately unnecessary, as the downloaded openNLP model for language detection, langdetect-183.bin, already contains all that
|
---|
143 |
|
---|
144 |
|
---|
145 | Individual downloading of Leipzig corpora for just Maori or specific languages can be found via:
|
---|
146 | http://wortschatz.uni-leipzig.de/en/download/
|
---|
147 |
|
---|
148 | [svn co https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig]
|
---|
149 | https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/
|
---|
150 | Here's the svn web view for the leipzig/data language files: https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data/
|
---|
151 |
|
---|
152 | [Here's the svn web view of apache.org: http://svn.eu.apache.org/viewvc
|
---|
153 | Unfortunately this is other apache stuff, including openNLP, but not the leipzig language data files I was after]
|
---|
154 |
|
---|
155 |
|
---|
156 | svn co --depth immediates --trust-server-cert --non-interactive https://svn.apache.org/repos/bigdata/opennlp/trunk opennlp-corpus
|
---|
157 | cd opennlp-corpus
|
---|
158 | svn up --set-depth immediates --trust-server-cert --non-interactive
|
---|
159 | cd leipzig
|
---|
160 | svn up --set-depth immediates --trust-server-cert --non-interactive
|
---|
161 | cd resources/
|
---|
162 | wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity --trust-server-cert --non-interactive
|
---|
163 | cd ../data
|
---|
164 | svn up --trust-server-cert --non-interactive mri_web_2011_100K-sentences.txt
|
---|
165 |
|
---|
166 | (# UNNECESSARY TO DOWNLOAD:
|
---|
167 | svn up eng_wikipedia_2012_3M-sentences.txt
|
---|
168 | svn up nld_mixed_2012_1M-sentences.txt
|
---|
169 | svn up fra_mixed_2009_1M-sentences.txt
|
---|
170 | )
|
---|
171 |
|
---|
172 | cd ..
|
---|
173 | # in opennlp-corpus/leipzig
|
---|
174 | chmod u+x create_langdetect_model.sh
|
---|
175 |
|
---|
176 |
|
---|
177 | cd /Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
|
---|
178 | export OPENNLP_HOME=`pwd`
|
---|
179 | ./create_langdetect_model.sh mri_web_2011_100K-sentences.txt
|
---|
180 | ./create_langdetect_model.sh nld_mixed_2012_1M-sentences.txt
|
---|
181 |
|
---|
182 |
|
---|
183 | 4. Attempting to run OpenNLP + LanguageDetectorModel's command line tools for language prediction
|
---|
184 |
|
---|
185 | I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages.
|
---|
186 |
|
---|
187 |
|
---|
188 | http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
|
---|
189 | http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
|
---|
190 | http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
|
---|
191 | "Usage: opennlp LanguageDetector model < documents"
|
---|
192 |
|
---|
193 | # must have exported OPENNLP_HOME (maybe add its bin to PATH?)
|
---|
194 | # Following the Usage instruction just above:
|
---|
195 | $ cd /Scratch/ak19/openNLP-lang-detect
|
---|
196 | $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt
|
---|
197 |
|
---|
198 | [# Sending all output into a file for inspection:
|
---|
199 | $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt > bla.txt 2>&1
|
---|
200 | ]
|
---|
201 |
|
---|
202 |
|
---|
203 |
|
---|
204 | [# Didn't work, even though the first opennlp manual link above said that the "opennlp LangDetector model" command should take input from stdin:
|
---|
205 |
|
---|
206 | ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < "Ko tÄnei te Whare WÄnanga o Waikato e whakatau nei i ngÄ iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngÄ maunga whakaruru e tau awhi nei."
|
---|
207 | ]
|
---|
208 |
|
---|
209 |
|
---|
210 | UPDATE 24 July 2019: The following worked
|
---|
211 | >$OPENNLP_HOME/bin/opennlp LanguageDetector $OPENNLP_HOME/models/langdetect-183.bin < opennlp-corpus/leipzig/data/mri_web_2011_100K-sentences.txt
|
---|
212 | but it doesn't provide predictions. Not sure I understand what it did other than print the contents of the *sentences.txt file and end with:
|
---|
213 | ...
|
---|
214 | Average: 0.1 doc/s
|
---|
215 | Total: 1 doc
|
---|
216 | Runtime: 8.046s
|
---|
217 | Execution time: 8.719 seconds
|
---|
218 |
|
---|
219 |
|
---|
220 | 5. For writing Java code:
|
---|
221 | To write the basic code, I followed the Java skeleton examples at
|
---|
222 | * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
|
---|
223 | * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
|
---|
224 |
|
---|
225 |
|
---|
226 | To get the imports correct, I searched for example/tutorial java code on using openNLP with the LanguageDetectorModel
|
---|
227 | * Java code: Import files:
|
---|
228 | https://www.tutorialkart.com/opennlp/language-detector-example-in-apache-opennlp/
|
---|
229 | * The tutorial link above also covers Java code to train detecting a particular language.
|
---|
230 |
|
---|
231 |
|
---|
232 | apache-opennlp-1.9.1/lib/opennlp-tools-1.9.1.jar contains the imports we want, in particular
|
---|
233 | import opennlp.tools.langdetect.*;
|
---|
234 | import opennlp.tools.util.*;
|
---|
235 |
|
---|
236 | 6. Wrote the very basic form of MaoriDetector.java class.
|
---|
237 |
|
---|
238 | To compile and run:
|
---|
239 |
|
---|
240 | a. export OPENNLP_HOME=/Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
|
---|
241 | b. wharariki:[115]/Scratch/ak19/openNLP-lang-detect/src>javac -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector.java
|
---|
242 | c. wharariki:[116]/Scratch/ak19/openNLP-lang-detect/src>java -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector
|
---|
243 |
|
---|
244 |
|
---|
245 | (Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar)
|
---|
246 |
|
---|
247 | ================
|
---|
248 |
|
---|
249 | UNORGANISED
|
---|
250 |
|
---|
251 | https://www.slideshare.net/shuyo/short-text-language-detection-with-infinitygram-12949447
|
---|
252 | http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html
|
---|
253 |
|
---|
254 | http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.sentdetect
|
---|
255 |
|
---|
256 | SentenceDetectorTrainer example
|
---|
257 | https://www.tutorialkart.com/opennlp/train-model-sentence-detection-java/
|
---|
258 |
|
---|
259 | https://stackoverflow.com/questions/36516363/sentence-detection-with-opennlp
|
---|