source: gs3-extensions/maori-lang-detection/README.txt@ 33339

Last change on this file since 33339 was 33339, checked in by ak19, 5 years ago

Updated README.

File size: 9.4 KB
Line 
1------------------
2BASIC README
3------------------
40. The code and its necessary helper files and libraries, and this README, live at:
5 http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection
6
7You can checkout from svn with:
8 svn co http://svn.greenstone.org/gs3-extensions/maori-lang-detection
9
10- It contains the OpenNLP that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html
11- The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin)
12
131. Once you've svn checked it out, create a folder called "models" and put the langdetect-183.bin file into it.
14(This is just a zip file, but has to remain with the .bin extension in order for OpenNLP to use it. If you ever wish to see its contents, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.)
15
162. Next extract the apache-opennlp-1.9.1-bin.tar.gz.
17This will create a folder called apache-opennlp-1.9.1. Move the "models" folder you created in step 1 into this folder.
18
193. Before you can compile or run the MaoriTextDetector program, you always have to prepare a terminal by setting up the environment for OpenNLP as follows:
20 cd /type/here/path/to/your/extracted/apache-opennlp-1.9.1
21 export OPENNLP_HOME=`pwd`
22
23
244. If you want to recompile, go up into the checked out maori-lang-detection folder's "src" subfolder. To compile, make sure you have the JDK7+ bin folder on your PATH environment variable.
25Still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, you can now run:
26 maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java
27
285. To run the MaoriTextDetector program, you will need the JDK or JRE 7+ bin folder on your PATH and still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3,
29type one of the following:
30 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --help
31 (prints the usage, including other options)
32
33 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --file <full/path/to/textfile>
34
35 maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector -
36 which expects text to stream in from standard input.
37 If entering text manually, then remember to press Ctrl-D to indicate the usual end of StdIn.
38
39
40
41
42For reading materials, see the OLD README section below.
43
44-------------------------
45 OLD README
46-------------------------
47
48
49http://opennlp.apache.org/news/model-langdetect-183.html
50Language Detector Model for Apache OpenNLP released
51
520. Can we make it run
53Can we detect it with with Eng, Fr, NL docs
54
551. openNLP - is Maori included,
56if not learn how to teach it to recognise Maori
57how you run the training bit of their software and add in Maori language training set
58
59ANSWER: Yes, Maori is included, see https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
60
612. Add in a small program to detect particularly Maori
62
633. Macron and non-macron input text language recognition support?
64
65-----------------------------------------
66
67READING:
68 General:
69 * https://stackoverflow.com/questions/7670427/how-does-language-detection-work
70 * https://github.com/andreasjansson/language-detection.el
71 * https://en.wikipedia.org/wiki/ISO_639-3
72
73 Specific:
74 * openNLP download: http://opennlp.apache.org/download.html
75 "Models
76 The models for Apache OpenNLP are found here.
77 The models can be used for testing or getting started, please train your own models for all other use cases."
78
79 * On LanguageDetectionModel for OpenNLP:
80 http://opennlp.apache.org/news/model-langdetect-183.html
81 https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
82 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
83 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
84 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
85 * Good precision etc on detecting Maori: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/langdetect-183.bin.report.txt
86 * Maybe useful: http://opennlp.sourceforge.net/models-1.5/
87 "Use the links in the table below to download the pre-trained models for the OpenNLP 1.5 series."
88
89
90
911. Download OpenNLP from https://opennlp.apache.org/download.html.
92I got the binary version. I unzipped it.
93
942. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html
95 Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin
96The page says:
97 "All models are zip compressed (like a jar file), they must not be uncompressed."
98- So when langdetect-183.bin is renamed to langdetect-183.zip, can inspect the contents.
99- But for using with openNLP, don't rename or unzip it.
100
1013. UNNECESSARY:
102I started by following the instructions at the botton of:
103https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
104 Note that the svn checkout from the leipzig data link is too huge. So I restricted it to just Maori, English, Dutch and French text, see further below.
105
106These instructions proved ultimately unnecessary, as the downloaded openNLP model for language detection, langdetect-183.bin, already contains all that
107
108
109Individual downloading of Leipzig corpora for just Maori or specific languages can be found via:
110 http://wortschatz.uni-leipzig.de/en/download/
111
112[svn co https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig]
113https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/
114Here's the svn web view for the leipzig/data language files: https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data/
115
116[Here's the svn web view of apache.org: http://svn.eu.apache.org/viewvc
117Unfortunately this is other apache stuff, including openNLP, but not the leipzig language data files I was after]
118
119
120 svn co https://svn.apache.org/repos/bigdata/opennlp/trunk --depth immediates
121 mv trunk opennlp-corpus
122 svn up --set-depth immediates
123 cd resources/
124 wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity
125 cd ../data
126 svn up mri_web_2011_100K-sentences.txt
127 svn up eng_wikipedia_2012_3M-sentences.txt
128 svn up nld_mixed_2012_1M-sentences.txt
129 svn up fra_mixed_2009_1M-sentences.txt
130
131 cd ..
132 # in opennlp-corpus/leipzig
133 chmod u+x create_langdetect_model.sh
134
135
136 cd /Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
137 export OPENNLP_HOME=`pwd`
138 ./create_langdetect_model.sh mri_web_2011_100K-sentences.txt
139 ./create_langdetect_model.sh nld_mixed_2012_1M-sentences.txt
140
141
1424. Attempting running OpenNLP + LanguageDetectorModel's command line tools for language prediction
143
144I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages.
145
146
147http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
148http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
149http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
150 "Usage: opennlp LanguageDetector model < documents"
151
152# must have exported OPENNLP_HOME (maybe add its bin to PATH?)
153# Following the Usage instruction just above:
154 $ cd /Scratch/ak19/openNLP-lang-detect
155 $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt
156
157 [# Sending all output into a file for inspection:
158 $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt > bla.txt 2>&1
159 ]
160
161
162
163 [# Didn't work, even though the first opennlp manual link above said that the "opennlp LangDetector model" command should take input from stdin:
164
165 ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < "Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei."
166 ]
167
168
1695. For writing Java code:
170To write the basic code, I followed the Java skeleton examples at
171 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
172 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
173
174
175To get the imports correct, I searched for example/tutorial java code on using openNLP with the LanguageDetectorModel
176* Java code: Import files:
177https://www.tutorialkart.com/opennlp/language-detector-example-in-apache-opennlp/
178* The tutorial link above also covers Java code to train detecting a particular language.
179
180
181apache-opennlp-1.9.1/lib/opennlp-tools-1.9.1.jar contains the imports we want, in particular
182 import opennlp.tools.langdetect.*;
183 import opennlp.tools.util.*;
184
1856. Wrote the very basic form of MaoriDetector.java class.
186
187To compile and run:
188
189a. export OPENNLP_HOME=/Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
190b. wharariki:[115]/Scratch/ak19/openNLP-lang-detect/src>javac -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector.java
191c. wharariki:[116]/Scratch/ak19/openNLP-lang-detect/src>java -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector
192
193
194(Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar)
Note: See TracBrowser for help on using the repository browser.