1 | http://opennlp.apache.org/news/model-langdetect-183.html
|
---|
2 | Language Detector Model for Apache OpenNLP released
|
---|
3 |
|
---|
4 | 0. Can we make it run
|
---|
5 | Can we detect it with with Eng, Fr, NL docs
|
---|
6 |
|
---|
7 | 1. openNLP - is Maori included,
|
---|
8 | if not learn how to teach it to recognise Maori
|
---|
9 | how you run the training bit of their software and add in Maori language training set
|
---|
10 |
|
---|
11 | ANSWER: Yes, Maori is included, see https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
|
---|
12 |
|
---|
13 | 2. Add in a small program to detect particularly Maori
|
---|
14 |
|
---|
15 | 3. Macron and non-macron input text language recognition support?
|
---|
16 |
|
---|
17 | -----------------------------------------
|
---|
18 |
|
---|
19 | READING:
|
---|
20 | General:
|
---|
21 | * https://stackoverflow.com/questions/7670427/how-does-language-detection-work
|
---|
22 | * https://github.com/andreasjansson/language-detection.el
|
---|
23 | * https://en.wikipedia.org/wiki/ISO_639-3
|
---|
24 |
|
---|
25 | Specific:
|
---|
26 | * openNLP download: http://opennlp.apache.org/download.html
|
---|
27 | "Models
|
---|
28 | The models for Apache OpenNLP are found here.
|
---|
29 | The models can be used for testing or getting started, please train your own models for all other use cases."
|
---|
30 |
|
---|
31 | * On LanguageDetectionModel for OpenNLP:
|
---|
32 | http://opennlp.apache.org/news/model-langdetect-183.html
|
---|
33 | https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
|
---|
34 | * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
|
---|
35 | * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
|
---|
36 | * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
|
---|
37 | * Good precision etc on detecting Maori: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/langdetect-183.bin.report.txt
|
---|
38 | * Maybe useful: http://opennlp.sourceforge.net/models-1.5/
|
---|
39 | "Use the links in the table below to download the pre-trained models for the OpenNLP 1.5 series."
|
---|
40 |
|
---|
41 |
|
---|
42 |
|
---|
43 | 1. Download OpenNLP from https://opennlp.apache.org/download.html.
|
---|
44 | I got the binary version. I unzipped it.
|
---|
45 |
|
---|
46 | 2. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html
|
---|
47 | Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin
|
---|
48 | The page says:
|
---|
49 | "All models are zip compressed (like a jar file), they must not be uncompressed."
|
---|
50 | - So when langdetect-183.bin is renamed to langdetect-183.zip, can inspect the contents.
|
---|
51 | - But for using with openNLP, don't rename or unzip it.
|
---|
52 |
|
---|
53 | 3. UNNECESSARY:
|
---|
54 | I started by following the instructions at the botton of:
|
---|
55 | https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt
|
---|
56 | Note that the svn checkout from the leipzig data link is too huge. So I restricted it to just Maori, English, Dutch and French text, see further below.
|
---|
57 |
|
---|
58 | These instructions proved ultimately unnecessary, as the downloaded openNLP model for language detection, langdetect-183.bin, already contains all that
|
---|
59 |
|
---|
60 |
|
---|
61 | Individual downloading of Leipzig corpora for just Maori or specific languages can be found via:
|
---|
62 | http://wortschatz.uni-leipzig.de/en/download/
|
---|
63 |
|
---|
64 | [svn co https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig]
|
---|
65 | https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/
|
---|
66 | Here's the svn web view for the leipzig/data language files: https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data/
|
---|
67 |
|
---|
68 | [Here's the svn web view of apache.org: http://svn.eu.apache.org/viewvc
|
---|
69 | Unfortunately this is other apache stuff, including openNLP, but not the leipzig language data files I was after]
|
---|
70 |
|
---|
71 |
|
---|
72 | svn co https://svn.apache.org/repos/bigdata/opennlp/trunk --depth immediates
|
---|
73 | mv trunk opennlp-corpus
|
---|
74 | svn up --set-depth immediates
|
---|
75 | cd resources/
|
---|
76 | wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity
|
---|
77 | cd ../data
|
---|
78 | svn up mri_web_2011_100K-sentences.txt
|
---|
79 | svn up eng_wikipedia_2012_3M-sentences.txt
|
---|
80 | svn up nld_mixed_2012_1M-sentences.txt
|
---|
81 | svn up fra_mixed_2009_1M-sentences.txt
|
---|
82 |
|
---|
83 | cd ..
|
---|
84 | # in opennlp-corpus/leipzig
|
---|
85 | chmod u+x create_langdetect_model.sh
|
---|
86 |
|
---|
87 |
|
---|
88 | cd /Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
|
---|
89 | export OPENNLP_HOME=`pwd`
|
---|
90 | ./create_langdetect_model.sh mri_web_2011_100K-sentences.txt
|
---|
91 | ./create_langdetect_model.sh nld_mixed_2012_1M-sentences.txt
|
---|
92 |
|
---|
93 |
|
---|
94 | 4. Attempting running OpenNLP + LanguageDetectorModel's command line tools for language prediction
|
---|
95 |
|
---|
96 | I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages.
|
---|
97 |
|
---|
98 |
|
---|
99 | http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
|
---|
100 | http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
|
---|
101 | http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect
|
---|
102 | "Usage: opennlp LanguageDetector model < documents"
|
---|
103 |
|
---|
104 | # must have exported OPENNLP_HOME (maybe add its bin to PATH?)
|
---|
105 | # Following the Usage instruction just above:
|
---|
106 | $ cd /Scratch/ak19/openNLP-lang-detect
|
---|
107 | $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt
|
---|
108 |
|
---|
109 | [# Sending all output into a file for inspection:
|
---|
110 | $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt > bla.txt 2>&1
|
---|
111 | ]
|
---|
112 |
|
---|
113 |
|
---|
114 |
|
---|
115 | [# Didn't work, even though the first opennlp manual link above said that the "opennlp LangDetector model" command should take input from stdin:
|
---|
116 |
|
---|
117 | ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < "Ko tÄnei te Whare WÄnanga o Waikato e whakatau nei i ngÄ iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngÄ maunga whakaruru e tau awhi nei."
|
---|
118 | ]
|
---|
119 |
|
---|
120 |
|
---|
121 | 5. For writing Java code:
|
---|
122 | To write the basic code, I followed the Java skeleton examples at
|
---|
123 | * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html
|
---|
124 | * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying
|
---|
125 |
|
---|
126 |
|
---|
127 | To get the imports correct, I searched for example/tutorial java code on using openNLP with the LanguageDetectorModel
|
---|
128 | * Java code: Import files:
|
---|
129 | https://www.tutorialkart.com/opennlp/language-detector-example-in-apache-opennlp/
|
---|
130 | * The tutorial link above also covers Java code to train detecting a particular language.
|
---|
131 |
|
---|
132 |
|
---|
133 | apache-opennlp-1.9.1/lib/opennlp-tools-1.9.1.jar contains the imports we want, in particular
|
---|
134 | import opennlp.tools.langdetect.*;
|
---|
135 | import opennlp.tools.util.*;
|
---|
136 |
|
---|
137 | 6. Wrote the very basic form of MaoriDetector.java class.
|
---|
138 |
|
---|
139 | To compile and run:
|
---|
140 |
|
---|
141 | a. export OPENNLP_HOME=/Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1
|
---|
142 | b. wharariki:[115]/Scratch/ak19/openNLP-lang-detect/src>javac -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector.java
|
---|
143 | c. wharariki:[116]/Scratch/ak19/openNLP-lang-detect/src>java -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector
|
---|
144 |
|
---|
145 |
|
---|
146 | (Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar)
|
---|