source: gs3-extensions/maori-lang-detection/README.txt@ 33357

Last change on this file since 33357 was 33357, checked in by ak19, 5 years ago

Minor changes

File size: 13.6 KB
Line 
1------------------
2BASIC README
3------------------
40. The code and its necessary helper files and libraries, and this README, live at:
5 http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection
6
7You can checkout from svn with:
8 svn co http://svn.greenstone.org/gs3-extensions/maori-lang-detection
9
10This checkout contains:
11- tarball of the OpenNLP version that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html
12- the folder "models-trainingdata-and-sampletxts", itself containing:
13 - langdetect-183.bin: The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin)
14 - mri-sent_trained.bin: our own custom generated model for Maori Sentence Detection
15 - mri-sent.train: the training text file for generating the mri-sent_trained.bin Maori Sentence Detection model
16 - sample_mri_paragraphs.txt: contains some text to test the Maori Sentence Detection model on. Its content is from Niupepa collection page http://www.nzdl.org/cgi-bin/library?gg=text&e=p-00000-00---off-0niupepa--00-0----0-10-0---0---0direct-10---4-------0-1l--11-en-50---20-about---00-0-1-00-0-0-11-1-0utfZz-8-00&a=p&p=about&l=mi&nw=utf-8)
17 - sample_maori_shorttext.txt: to test the MaoriTextDetector.java with
18- "src" folder for Java classes, currently just MaoriTextDetector.java and its classfile. MaoriTextDetector.java uses the aforementioned LanguageDetectionModel, langdetect-183.bin, to detect whether input text from a file or stdin is in Maori or not
19- gen_SentenceDetection_model.sh, our custom script that generates both the mri-sent.train and model mri-sent_trained.bin files mentioned above
20 - the script works on opennlp's leipzig corpus of 100k Maori sentences from 2011 to get its sample sentences into the correct format in the mri-sent.train file
21 - from this file containing training sentences, it generates the Sentence Detector Model, mri-sent_trained.bin
22- mri-opennlp-corpus.tar.gz: a tarball containing the 100k Maori sentences opennlp corpus checked out with svn in its original directory structure from https://svn.apache.org/repos/bigdata/opennlp/trunk/mri_web_2011_100K-sentences.txt
23
24
251. Once you've svn checked out the maori-lang-detection from gs3-extensions, create a folder called "models" and put the files langdetect-183.bin and mri-sent_trained.bin from the folder "models-trainingdata-and-sampletxts" into it.
26(These are just zip files, but have to remain with the .bin extension in order for OpenNLP to use them. If you ever wish to see the contents of such a .bin file, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.)
27
28You can optionally put the file models-trainingdata-and-sampletxts/
29
30
312. Next extract the apache-opennlp-1.9.1-bin.tar.gz.
32This will create a folder called apache-opennlp-1.9.1. Move the "models" folder you created in step 1 into this folder.
33
34
353. Before you can compile or run the MaoriTextDetector program, you always have to prepare a terminal by setting up the environment for OpenNLP as follows:
36 cd /type/here/path/to/your/extracted/apache-opennlp-1.9.1
37 export OPENNLP_HOME=`pwd`
38
39
404. If you want to recompile, go up into the checked out maori-lang-detection folder's "src" subfolder. To compile, make sure you have the JDK7+ bin folder on your PATH environment variable.
41Still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, you can now run:
42 maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java
43
44
455. To run the MaoriTextDetector program, you will need the JDK or JRE 7+ bin folder on your PATH and still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3,
46type one of the following:
47 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --help
48 (prints the usage, including other options)
49
50 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --file <full/path/to/textfile>
51
52 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector -
53 which expects text to stream in from standard input.
54 If entering text manually, then remember to press Ctrl-D to indicate the usual end of StdIn.
55
56
57For links to background reading materials, see the OLD README section further below.
58
59
60NOTE: The OpenNLP Language Detection Model can detect non-macronised Māori text too,
61but as anticipated, the same text produces a lower confidence level for the language prediction. Compare:
62
63$maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector -
64 Waiting to read text from STDIN... (press Ctrl-D when done entering text)>
65 Ko tenei te Whare Wananga o Waikato e whakatau nei i nga iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o nga maunga whakaruru e tau awhi nei.
66 Best language: mri
67 Best language confidence: 0.5959533972070814
68 Exitting program with returnVal 0...
69
70$maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector -
71 Waiting to read text from STDIN... (press Ctrl-D when done entering text)>
72 Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei.
73 Best language: mri
74 Best language confidence: 0.6825737450092515
75 Exitting program with returnVal 0...
76
77
78-------------------------
79 OLD README
80-------------------------
81
82
83http://opennlp.apache.org/news/model-langdetect-183.html
84Language Detector Model for Apache OpenNLP released
85
860. Can we make it run
87Can we detect it with with Eng, Fr, NL docs
88
891. openNLP - is Maori included,
90if not learn how to teach it to recognise Maori
91how you run the training bit of their software and add in Maori language training set
92
93ANSWER: Yes, Maori is included, see https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
94
952. Add in a small program to detect particularly Maori
96
973. Macron and non-macron input text language recognition support?
98
99-----------------------------------------
100
101READING:
102 General:
103 * https://stackoverflow.com/questions/7670427/how-does-language-detection-work
104 * https://github.com/andreasjansson/language-detection.el
105 * https://en.wikipedia.org/wiki/ISO_639-3
106
107 Specific:
108 * openNLP download: http://opennlp.apache.org/download.html
109 "Models
110 The models for Apache OpenNLP are found here.
111 The models can be used for testing or getting started, please train your own models for all other use cases."
112
113 * On LanguageDetectionModel for OpenNLP:
114 http://opennlp.apache.org/news/model-langdetect-183.html
115 https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
116 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
117 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
118 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
119 * Good precision etc on detecting Maori: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/langdetect-183.bin.report.txt
120 * Maybe useful: http://opennlp.sourceforge.net/models-1.5/
121 "Use the links in the table below to download the pre-trained models for the OpenNLP 1.5 series."
122
123
124
1251. Download OpenNLP from https://opennlp.apache.org/download.html.
126I got the binary version. I unzipped it.
127
128
1292. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html
130 Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin
131The page says:
132 "All models are zip compressed (like a jar file), they must not be uncompressed."
133- So when langdetect-183.bin is renamed to langdetect-183.zip, can inspect the contents.
134- But for using with openNLP, don't rename or unzip it.
135
136
1373. UNNECESSARY:
138I started by following the instructions at the botton of:
139https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
140 Note that the svn checkout from the leipzig data link is too huge. So I restricted it to just Maori, English, Dutch and French text, see further below.
141
142These instructions proved ultimately unnecessary, as the downloaded openNLP model for language detection, langdetect-183.bin, already contains all that
143
144
145Individual downloading of Leipzig corpora for just Maori or specific languages can be found via:
146 http://wortschatz.uni-leipzig.de/en/download/
147
148[svn co https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig]
149https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/
150Here's the svn web view for the leipzig/data language files: https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data/
151
152[Here's the svn web view of apache.org: http://svn.eu.apache.org/viewvc
153Unfortunately this is other apache stuff, including openNLP, but not the leipzig language data files I was after]
154
155
156 svn co --depth immediates --trust-server-cert --non-interactive https://svn.apache.org/repos/bigdata/opennlp/trunk opennlp-corpus
157 cd opennlp-corpus
158 svn up --set-depth immediates --trust-server-cert --non-interactive
159 cd leipzig
160 svn up --set-depth immediates --trust-server-cert --non-interactive
161 cd resources/
162 wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity --trust-server-cert --non-interactive
163 cd ../data
164 svn up --trust-server-cert --non-interactive mri_web_2011_100K-sentences.txt
165
166 (# UNNECESSARY TO DOWNLOAD:
167 svn up eng_wikipedia_2012_3M-sentences.txt
168 svn up nld_mixed_2012_1M-sentences.txt
169 svn up fra_mixed_2009_1M-sentences.txt
170 )
171
172 cd ..
173 # in opennlp-corpus/leipzig
174 chmod u+x create_langdetect_model.sh
175
176
177 cd /Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
178 export OPENNLP_HOME=`pwd`
179 ./create_langdetect_model.sh mri_web_2011_100K-sentences.txt
180 ./create_langdetect_model.sh nld_mixed_2012_1M-sentences.txt
181
182
1834. Attempting to run OpenNLP + LanguageDetectorModel's command line tools for language prediction
184
185I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages.
186
187
188http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
189http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
190http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
191 "Usage: opennlp LanguageDetector model < documents"
192
193# must have exported OPENNLP_HOME (maybe add its bin to PATH?)
194# Following the Usage instruction just above:
195 $ cd /Scratch/ak19/openNLP-lang-detect
196 $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt
197
198 [# Sending all output into a file for inspection:
199 $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt > bla.txt 2>&1
200 ]
201
202
203
204 [# Didn't work, even though the first opennlp manual link above said that the "opennlp LangDetector model" command should take input from stdin:
205
206 ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < "Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei."
207 ]
208
209
210UPDATE 24 July 2019: The following worked
211 >$OPENNLP_HOME/bin/opennlp LanguageDetector $OPENNLP_HOME/models/langdetect-183.bin < opennlp-corpus/leipzig/data/mri_web_2011_100K-sentences.txt
212but it doesn't provide predictions. Not sure I understand what it did other than print the contents of the *sentences.txt file and end with:
213 ...
214 Average: 0.1 doc/s
215 Total: 1 doc
216 Runtime: 8.046s
217 Execution time: 8.719 seconds
218
219
2205. For writing Java code:
221To write the basic code, I followed the Java skeleton examples at
222 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
223 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
224
225
226To get the imports correct, I searched for example/tutorial java code on using openNLP with the LanguageDetectorModel
227* Java code: Import files:
228https://www.tutorialkart.com/opennlp/language-detector-example-in-apache-opennlp/
229* The tutorial link above also covers Java code to train detecting a particular language.
230
231
232apache-opennlp-1.9.1/lib/opennlp-tools-1.9.1.jar contains the imports we want, in particular
233 import opennlp.tools.langdetect.*;
234 import opennlp.tools.util.*;
235
2366. Wrote the very basic form of MaoriDetector.java class.
237
238To compile and run:
239
240a. export OPENNLP_HOME=/Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
241b. wharariki:[115]/Scratch/ak19/openNLP-lang-detect/src>javac -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector.java
242c. wharariki:[116]/Scratch/ak19/openNLP-lang-detect/src>java -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector
243
244
245(Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar)
246
247================
248
249GENERAL LINKS FOR LANGUAGE DETECTION
250
251https://www.slideshare.net/shuyo/short-text-language-detection-with-infinitygram-12949447
252http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html
253
254LINKS FOR SENTENCE DETECTION AND TRAINING A SENTENCE DETECTOR
255
256http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.sentdetect
257
258SentenceDetectorTrainer example
259https://www.tutorialkart.com/opennlp/train-model-sentence-detection-java/
260Commandline version:
261https://stackoverflow.com/questions/36516363/sentence-detection-with-opennlp
Note: See TracBrowser for help on using the repository browser.