------------------ BASIC README ------------------ 0. The code and its necessary helper files and libraries, and this README, live at: http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection You can checkout from svn with: svn co http://svn.greenstone.org/gs3-extensions/maori-lang-detection - It contains the OpenNLP that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html - The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin) 1. Once you've svn checked it out, create a folder called "models" and put the langdetect-183.bin file into it. (This is just a zip file, but has to remain with the .bin extension in order for OpenNLP to use it. If you ever wish to see its contents, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.) 2. Next extract the apache-opennlp-1.9.1-bin.tar.gz. This will create a folder called apache-opennlp-1.9.1. Move the "models" folder you created in step 1 into this folder. 3. Before you can compile or run the MaoriTextDetector program, you always have to prepare a terminal by setting up the environment for OpenNLP as follows: cd /type/here/path/to/your/extracted/apache-opennlp-1.9.1 export OPENNLP_HOME=`pwd` 4. If you want to recompile, go up into the checked out maori-lang-detection folder's "src" subfolder. To compile, make sure you have the JDK7+ bin folder on your PATH environment variable. Still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, you can now run: maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java 5. To run the MaoriTextDetector program, you will need the JDK or JRE 7+ bin folder on your PATH and still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, type one of the following: maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --help (prints the usage, including other options) maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --file <full/path/to/textfile> maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector - which expects text to stream in from standard input. If entering text manually, then remember to press Ctrl-D to indicate the usual end of StdIn. For reading materials, see the OLD README section below. ------------------------- OLD README ------------------------- http://opennlp.apache.org/news/model-langdetect-183.html Language Detector Model for Apache OpenNLP released 0. Can we make it run Can we detect it with with Eng, Fr, NL docs 1. openNLP - is Maori included, if not learn how to teach it to recognise Maori how you run the training bit of their software and add in Maori language training set ANSWER: Yes, Maori is included, see https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt 2. Add in a small program to detect particularly Maori 3. Macron and non-macron input text language recognition support? ----------------------------------------- READING: General: * https://stackoverflow.com/questions/7670427/how-does-language-detection-work * https://github.com/andreasjansson/language-detection.el * https://en.wikipedia.org/wiki/ISO_639-3 Specific: * openNLP download: http://opennlp.apache.org/download.html "Models The models for Apache OpenNLP are found here. The models can be used for testing or getting started, please train your own models for all other use cases." * On LanguageDetectionModel for OpenNLP: http://opennlp.apache.org/news/model-langdetect-183.html https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect * Good precision etc on detecting Maori: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/langdetect-183.bin.report.txt * Maybe useful: http://opennlp.sourceforge.net/models-1.5/ "Use the links in the table below to download the pre-trained models for the OpenNLP 1.5 series." 1. Download OpenNLP from https://opennlp.apache.org/download.html. I got the binary version. I unzipped it. 2. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin The page says: "All models are zip compressed (like a jar file), they must not be uncompressed." - So when langdetect-183.bin is renamed to langdetect-183.zip, can inspect the contents. - But for using with openNLP, don't rename or unzip it. 3. UNNECESSARY: I started by following the instructions at the botton of: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt Note that the svn checkout from the leipzig data link is too huge. So I restricted it to just Maori, English, Dutch and French text, see further below. These instructions proved ultimately unnecessary, as the downloaded openNLP model for language detection, langdetect-183.bin, already contains all that Individual downloading of Leipzig corpora for just Maori or specific languages can be found via: http://wortschatz.uni-leipzig.de/en/download/ [svn co https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig] https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/ Here's the svn web view for the leipzig/data language files: https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data/ [Here's the svn web view of apache.org: http://svn.eu.apache.org/viewvc Unfortunately this is other apache stuff, including openNLP, but not the leipzig language data files I was after] svn co https://svn.apache.org/repos/bigdata/opennlp/trunk --depth immediates mv trunk opennlp-corpus svn up --set-depth immediates cd resources/ wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity cd ../data svn up mri_web_2011_100K-sentences.txt svn up eng_wikipedia_2012_3M-sentences.txt svn up nld_mixed_2012_1M-sentences.txt svn up fra_mixed_2009_1M-sentences.txt cd .. # in opennlp-corpus/leipzig chmod u+x create_langdetect_model.sh cd /Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1 export OPENNLP_HOME=`pwd` ./create_langdetect_model.sh mri_web_2011_100K-sentences.txt ./create_langdetect_model.sh nld_mixed_2012_1M-sentences.txt 4. Attempting running OpenNLP + LanguageDetectorModel's command line tools for language prediction I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages. http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect "Usage: opennlp LanguageDetector model < documents" # must have exported OPENNLP_HOME (maybe add its bin to PATH?) # Following the Usage instruction just above: $ cd /Scratch/ak19/openNLP-lang-detect $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt [# Sending all output into a file for inspection: $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt > bla.txt 2>&1 ] [# Didn't work, even though the first opennlp manual link above said that the "opennlp LangDetector model" command should take input from stdin: ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < "Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei." ] 5. For writing Java code: To write the basic code, I followed the Java skeleton examples at * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying To get the imports correct, I searched for example/tutorial java code on using openNLP with the LanguageDetectorModel * Java code: Import files: https://www.tutorialkart.com/opennlp/language-detector-example-in-apache-opennlp/ * The tutorial link above also covers Java code to train detecting a particular language. apache-opennlp-1.9.1/lib/opennlp-tools-1.9.1.jar contains the imports we want, in particular import opennlp.tools.langdetect.*; import opennlp.tools.util.*; 6. Wrote the very basic form of MaoriDetector.java class. To compile and run: a. export OPENNLP_HOME=/Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1 b. wharariki:[115]/Scratch/ak19/openNLP-lang-detect/src>javac -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector.java c. wharariki:[116]/Scratch/ak19/openNLP-lang-detect/src>java -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector (Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar)