source: other-projects/maori-lang-detection/README.txt@ 33868

Last change on this file since 33868 was 33398, checked in by ak19, 5 years ago

Committing the actual package structure and the updated README after changing the package structure and instructions on compiling/running as there will be more Java classes.

File size: 14.0 KB
Line 
1-------------------------
2BASIC README: QUICK SETUP
3-------------------------
40. The code and its necessary helper files and libraries, and this README, live at:
5 http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection
6
7You can checkout from svn with:
8 svn co http://svn.greenstone.org/gs3-extensions/maori-lang-detection
9
10This checkout contains:
11- tarball of the OpenNLP version that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html
12- the folder "models-trainingdata-and-sampletxts", itself containing:
13 - langdetect-183.bin: The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin)
14 - mri-sent_trained.bin: our own custom generated model for Maori Sentence Detection
15 - mri-sent.train: the training text file for generating the mri-sent_trained.bin Maori Sentence Detection model
16 - sample_mri_paragraphs.txt: contains some text to test the Maori Sentence Detection model on. Its content is from Niupepa collection page http://www.nzdl.org/cgi-bin/library?gg=text&e=p-00000-00---off-0niupepa--00-0----0-10-0---0---0direct-10---4-------0-1l--11-en-50---20-about---00-0-1-00-0-0-11-1-0utfZz-8-00&a=p&p=about&l=mi&nw=utf-8)
17 - sample_maori_shorttext.txt: to test the MaoriTextDetector.java with
18- "src" folder for Java classes, currently just MaoriTextDetector.java and its classfile. MaoriTextDetector.java uses the aforementioned LanguageDetectionModel, langdetect-183.bin, to detect whether input text from a file or stdin is in Maori or not
19- gen_SentenceDetection_model.sh, our custom script that generates both the mri-sent.train and model mri-sent_trained.bin files mentioned above
20 - the script works on opennlp's leipzig corpus of 100k Maori sentences from 2011 to get its sample sentences into the correct format in the mri-sent.train file
21 - from this file containing training sentences, it generates the Sentence Detector Model, mri-sent_trained.bin
22- mri-opennlp-corpus.tar.gz: a tarball containing the 100k Maori sentences opennlp corpus checked out with svn in its original directory structure from https://svn.apache.org/repos/bigdata/opennlp/trunk/mri_web_2011_100K-sentences.txt
23
24
251. Once you've svn checked out the maori-lang-detection from gs3-extensions, create a folder called "models" and put the files langdetect-183.bin and mri-sent_trained.bin from the folder "models-trainingdata-and-sampletxts" into it.
26(These are just zip files, but have to remain with the .bin extension in order for OpenNLP to use them. If you ever wish to see the contents of such a .bin file, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.)
27
28
292. Next extract the apache-opennlp-1.9.1-bin.tar.gz.
30This will create a folder called apache-opennlp-1.9.1. Move the "models" folder you created in step 1 into this folder.
31
32
333. Before you can compile or run the MaoriTextDetector program, you always have to prepare a terminal by setting up the environment for OpenNLP as follows:
34 cd /type/here/path/to/your/extracted/apache-opennlp-1.9.1
35 export OPENNLP_HOME=`pwd`
36
37
384. If you want to recompile, go up into the checked out maori-lang-detection folder's "src" subfolder. To compile, make sure you have the JDK7+ bin folder on your PATH environment variable.
39Still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, you can now run:
40 maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/MaoriTextDetector.java
41
425. To run the MaoriTextDetector program, you will need the JDK or JRE 7+ bin folder on your PATH and still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3,
43type one of the following:
44 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" org.greenstone.atea.MaoriTextDetector --help
45 (prints the usage, including other options)
46
47 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" org.greenstone.atea.MaoriTextDetector --file <full/path/to/textfile>
48
49 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" org.greenstone.atea.MaoriTextDetector -
50 which expects text to stream in from standard input.
51 If entering text manually, then remember to press Ctrl-D to indicate the end of StdIn as usual.
52
53
54For links to background reading materials, see the OLD README section further below.
55
56
57NOTE: The OpenNLP Language Detection Model can detect non-macronised Māori text too,
58but as anticipated, the same text produces a lower confidence level for the language prediction. Compare:
59
60$maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector -
61 Waiting to read text from STDIN... (press Ctrl-D when done entering text)>
62 Ko tenei te Whare Wananga o Waikato e whakatau nei i nga iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o nga maunga whakaruru e tau awhi nei.
63 Best language: mri
64 Best language confidence: 0.5959533972070814
65 Exitting program with returnVal 0...
66
67$maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector -
68 Waiting to read text from STDIN... (press Ctrl-D when done entering text)>
69 Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei.
70 Best language: mri
71 Best language confidence: 0.6825737450092515
72 Exitting program with returnVal 0...
73
74
75-------------------------
76 OLD README
77-------------------------
78
79
80http://opennlp.apache.org/news/model-langdetect-183.html
81Language Detector Model for Apache OpenNLP released
82
830. Can we make it run
84Can we detect it with with Eng, Fr, NL docs
85
861. openNLP - is Maori included,
87if not learn how to teach it to recognise Maori
88how you run the training bit of their software and add in Maori language training set
89
90ANSWER: Yes, Maori is included, see https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
91
922. Add in a small program to detect particularly Maori
93
943. Macron and non-macron input text language recognition support?
95
96-----------------------------------------
97
98OPENNLP LANGUAGE DETECTION READING:
99 General:
100 * https://stackoverflow.com/questions/7670427/how-does-language-detection-work
101 * https://github.com/andreasjansson/language-detection.el
102 * https://en.wikipedia.org/wiki/ISO_639-3
103
104 Specific:
105 * openNLP download: http://opennlp.apache.org/download.html
106 "Models
107 The models for Apache OpenNLP are found here.
108 The models can be used for testing or getting started, please train your own models for all other use cases."
109
110 * On LanguageDetectionModel for OpenNLP:
111 http://opennlp.apache.org/news/model-langdetect-183.html
112 https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
113 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
114 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
115 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
116 * Good precision etc on detecting Maori: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/langdetect-183.bin.report.txt
117 * Maybe useful: http://opennlp.sourceforge.net/models-1.5/
118 "Use the links in the table below to download the pre-trained models for the OpenNLP 1.5 series."
119
120
121
1221. Download OpenNLP from https://opennlp.apache.org/download.html.
123I got the binary version. I unzipped it.
124
125
1262. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html
127 Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin
128The page says:
129 "All models are zip compressed (like a jar file), they must not be uncompressed."
130- So when langdetect-183.bin is renamed to langdetect-183.zip, can inspect the contents.
131- But for using with openNLP, don't rename or unzip it.
132
133
1343. UNNECESSARY:
135I started by following the instructions at the botton of:
136https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
137 Note that the svn checkout from the leipzig data link is too huge. So I restricted it to just Maori, English, Dutch and French text, see further below.
138
139These instructions proved ultimately unnecessary, as the downloaded openNLP model for language detection, langdetect-183.bin, already contains all that
140
141
142Individual downloading of Leipzig corpora for just Maori or specific languages can be found via:
143 http://wortschatz.uni-leipzig.de/en/download/
144
145[svn co https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig]
146https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/
147Here's the svn web view for the leipzig/data language files: https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data/
148
149[Here's the svn web view of apache.org: http://svn.eu.apache.org/viewvc
150Unfortunately this is other apache stuff, including openNLP, but not the leipzig language data files I was after]
151
152
153 svn co --depth immediates --trust-server-cert --non-interactive https://svn.apache.org/repos/bigdata/opennlp/trunk opennlp-corpus
154 cd opennlp-corpus
155 svn up --set-depth immediates --trust-server-cert --non-interactive
156 cd leipzig
157 svn up --set-depth immediates --trust-server-cert --non-interactive
158 cd resources/
159 wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity --trust-server-cert --non-interactive
160 cd ../data
161 svn up --trust-server-cert --non-interactive mri_web_2011_100K-sentences.txt
162
163 (# UNNECESSARY TO DOWNLOAD:
164 svn up eng_wikipedia_2012_3M-sentences.txt
165 svn up nld_mixed_2012_1M-sentences.txt
166 svn up fra_mixed_2009_1M-sentences.txt
167 )
168
169 cd ..
170 # in opennlp-corpus/leipzig
171 chmod u+x create_langdetect_model.sh
172
173
174 cd /Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
175 export OPENNLP_HOME=`pwd`
176 ./create_langdetect_model.sh mri_web_2011_100K-sentences.txt
177 ./create_langdetect_model.sh nld_mixed_2012_1M-sentences.txt
178
179
1804. Attempting to run OpenNLP + LanguageDetectorModel's command line tools for language prediction
181
182I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages.
183
184
185http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
186http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
187http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
188 "Usage: opennlp LanguageDetector model < documents"
189
190# must have exported OPENNLP_HOME (maybe add its bin to PATH?)
191# Following the Usage instruction just above:
192 $ cd /Scratch/ak19/openNLP-lang-detect
193 $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt
194
195 [# Sending all output into a file for inspection:
196 $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt > bla.txt 2>&1
197 ]
198
199
200
201 [# Didn't work, even though the first opennlp manual link above said that the "opennlp LangDetector model" command should take input from stdin:
202
203 ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < "Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei."
204 ]
205
206
207UPDATE 24 July 2019: The following worked
208 >$OPENNLP_HOME/bin/opennlp LanguageDetector $OPENNLP_HOME/models/langdetect-183.bin < opennlp-corpus/leipzig/data/mri_web_2011_100K-sentences.txt
209but it doesn't provide predictions. Not sure I understand what it did other than print the contents of the *sentences.txt file and end with:
210 ...
211 Average: 0.1 doc/s
212 Total: 1 doc
213 Runtime: 8.046s
214 Execution time: 8.719 seconds
215
216
2175. For writing Java code:
218To write the basic code, I followed the Java skeleton examples at
219 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
220 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
221
222
223To get the imports correct, I searched for example/tutorial java code on using openNLP with the LanguageDetectorModel
224* Java code: Import files:
225https://www.tutorialkart.com/opennlp/language-detector-example-in-apache-opennlp/
226* The tutorial link above also covers Java code to train detecting a particular language.
227
228
229apache-opennlp-1.9.1/lib/opennlp-tools-1.9.1.jar contains the imports we want, in particular
230 import opennlp.tools.langdetect.*;
231 import opennlp.tools.util.*;
232
2336. Wrote the very basic form of MaoriDetector.java class.
234
235To compile and run:
236
237a. export OPENNLP_HOME=/Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
238b. wharariki:[115]/Scratch/ak19/openNLP-lang-detect/src>javac -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector.java
239c. wharariki:[116]/Scratch/ak19/openNLP-lang-detect/src>java -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector
240
241
242(Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar)
243
244-------------------------------------------------
245READING: GENERAL LINKS FOR LANGUAGE DETECTION
246-------------------------------------------------
247
248https://www.slideshare.net/shuyo/short-text-language-detection-with-infinitygram-12949447
249http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html
250
251--------------------------------------------------------------------------
252READING: LINKS FOR SENTENCE DETECTION AND TRAINING A SENTENCE DETECTOR
253--------------------------------------------------------------------------
254
255* http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.sentdetect
256* SentenceDetectorTrainer java example:
257https://www.tutorialkart.com/opennlp/train-model-sentence-detection-java/
258* SentenceDetectorTrainer and SentenceDetector from commandline:
259https://stackoverflow.com/questions/36516363/sentence-detection-with-opennlp
260
261
Note: See TracBrowser for help on using the repository browser.