------------------------- BASIC README: QUICK SETUP ------------------------- 0. The code and its necessary helper files and libraries, and this README, live at: http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection You can checkout from svn with: svn co http://svn.greenstone.org/gs3-extensions/maori-lang-detection This checkout contains: - tarball of the OpenNLP version that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html - the folder "models-trainingdata-and-sampletxts", itself containing: - langdetect-183.bin: The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin) - mri-sent_trained.bin: our own custom generated model for Maori Sentence Detection - mri-sent.train: the training text file for generating the mri-sent_trained.bin Maori Sentence Detection model - sample_mri_paragraphs.txt: contains some text to test the Maori Sentence Detection model on. Its content is from Niupepa collection page http://www.nzdl.org/cgi-bin/library?gg=text&e=p-00000-00---off-0niupepa--00-0----0-10-0---0---0direct-10---4-------0-1l--11-en-50---20-about---00-0-1-00-0-0-11-1-0utfZz-8-00&a=p&p=about&l=mi&nw=utf-8) - sample_maori_shorttext.txt: to test the MaoriTextDetector.java with - "src" folder for Java classes, currently just MaoriTextDetector.java and its classfile. MaoriTextDetector.java uses the aforementioned LanguageDetectionModel, langdetect-183.bin, to detect whether input text from a file or stdin is in Maori or not - gen_SentenceDetection_model.sh, our custom script that generates both the mri-sent.train and model mri-sent_trained.bin files mentioned above - the script works on opennlp's leipzig corpus of 100k Maori sentences from 2011 to get its sample sentences into the correct format in the mri-sent.train file - from this file containing training sentences, it generates the Sentence Detector Model, mri-sent_trained.bin - mri-opennlp-corpus.tar.gz: a tarball containing the 100k Maori sentences opennlp corpus checked out with svn in its original directory structure from https://svn.apache.org/repos/bigdata/opennlp/trunk/mri_web_2011_100K-sentences.txt 1. Once you've svn checked out the maori-lang-detection from gs3-extensions, create a folder called "models" and put the files langdetect-183.bin and mri-sent_trained.bin from the folder "models-trainingdata-and-sampletxts" into it. (These are just zip files, but have to remain with the .bin extension in order for OpenNLP to use them. If you ever wish to see the contents of such a .bin file, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.) You can optionally put the file models-trainingdata-and-sampletxts/ 2. Next extract the apache-opennlp-1.9.1-bin.tar.gz. This will create a folder called apache-opennlp-1.9.1. Move the "models" folder you created in step 1 into this folder. 3. Before you can compile or run the MaoriTextDetector program, you always have to prepare a terminal by setting up the environment for OpenNLP as follows: cd /type/here/path/to/your/extracted/apache-opennlp-1.9.1 export OPENNLP_HOME=`pwd` 4. If you want to recompile, go up into the checked out maori-lang-detection folder's "src" subfolder. To compile, make sure you have the JDK7+ bin folder on your PATH environment variable. Still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, you can now run: maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java 5. To run the MaoriTextDetector program, you will need the JDK or JRE 7+ bin folder on your PATH and still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, type one of the following: maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --help (prints the usage, including other options) maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --file <full/path/to/textfile> maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector - which expects text to stream in from standard input. If entering text manually, then remember to press Ctrl-D to indicate the usual end of StdIn. For links to background reading materials, see the OLD README section further below. NOTE: The OpenNLP Language Detection Model can detect non-macronised Māori text too, but as anticipated, the same text produces a lower confidence level for the language prediction. Compare: $maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector - Waiting to read text from STDIN... (press Ctrl-D when done entering text)> Ko tenei te Whare Wananga o Waikato e whakatau nei i nga iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o nga maunga whakaruru e tau awhi nei. Best language: mri Best language confidence: 0.5959533972070814 Exitting program with returnVal 0... $maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector - Waiting to read text from STDIN... (press Ctrl-D when done entering text)> Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei. Best language: mri Best language confidence: 0.6825737450092515 Exitting program with returnVal 0... ------------------------- OLD README ------------------------- http://opennlp.apache.org/news/model-langdetect-183.html Language Detector Model for Apache OpenNLP released 0. Can we make it run Can we detect it with with Eng, Fr, NL docs 1. openNLP - is Maori included, if not learn how to teach it to recognise Maori how you run the training bit of their software and add in Maori language training set ANSWER: Yes, Maori is included, see https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt 2. Add in a small program to detect particularly Maori 3. Macron and non-macron input text language recognition support? ----------------------------------------- OPENNLP LANGUAGE DETECTION READING: General: * https://stackoverflow.com/questions/7670427/how-does-language-detection-work * https://github.com/andreasjansson/language-detection.el * https://en.wikipedia.org/wiki/ISO_639-3 Specific: * openNLP download: http://opennlp.apache.org/download.html "Models The models for Apache OpenNLP are found here. The models can be used for testing or getting started, please train your own models for all other use cases." * On LanguageDetectionModel for OpenNLP: http://opennlp.apache.org/news/model-langdetect-183.html https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect * Good precision etc on detecting Maori: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/langdetect-183.bin.report.txt * Maybe useful: http://opennlp.sourceforge.net/models-1.5/ "Use the links in the table below to download the pre-trained models for the OpenNLP 1.5 series." 1. Download OpenNLP from https://opennlp.apache.org/download.html. I got the binary version. I unzipped it. 2. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin The page says: "All models are zip compressed (like a jar file), they must not be uncompressed." - So when langdetect-183.bin is renamed to langdetect-183.zip, can inspect the contents. - But for using with openNLP, don't rename or unzip it. 3. UNNECESSARY: I started by following the instructions at the botton of: https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt Note that the svn checkout from the leipzig data link is too huge. So I restricted it to just Maori, English, Dutch and French text, see further below. These instructions proved ultimately unnecessary, as the downloaded openNLP model for language detection, langdetect-183.bin, already contains all that Individual downloading of Leipzig corpora for just Maori or specific languages can be found via: http://wortschatz.uni-leipzig.de/en/download/ [svn co https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig] https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/ Here's the svn web view for the leipzig/data language files: https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data/ [Here's the svn web view of apache.org: http://svn.eu.apache.org/viewvc Unfortunately this is other apache stuff, including openNLP, but not the leipzig language data files I was after] svn co --depth immediates --trust-server-cert --non-interactive https://svn.apache.org/repos/bigdata/opennlp/trunk opennlp-corpus cd opennlp-corpus svn up --set-depth immediates --trust-server-cert --non-interactive cd leipzig svn up --set-depth immediates --trust-server-cert --non-interactive cd resources/ wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity --trust-server-cert --non-interactive cd ../data svn up --trust-server-cert --non-interactive mri_web_2011_100K-sentences.txt (# UNNECESSARY TO DOWNLOAD: svn up eng_wikipedia_2012_3M-sentences.txt svn up nld_mixed_2012_1M-sentences.txt svn up fra_mixed_2009_1M-sentences.txt ) cd .. # in opennlp-corpus/leipzig chmod u+x create_langdetect_model.sh cd /Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1 export OPENNLP_HOME=`pwd` ./create_langdetect_model.sh mri_web_2011_100K-sentences.txt ./create_langdetect_model.sh nld_mixed_2012_1M-sentences.txt 4. Attempting to run OpenNLP + LanguageDetectorModel's command line tools for language prediction I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages. http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.langdetect "Usage: opennlp LanguageDetector model < documents" # must have exported OPENNLP_HOME (maybe add its bin to PATH?) # Following the Usage instruction just above: $ cd /Scratch/ak19/openNLP-lang-detect $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt [# Sending all output into a file for inspection: $ ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < opennlp-corpus-downloaded/mri-nz_web_2017_100K/mri-nz_web_2017_100K-words.txt > bla.txt 2>&1 ] [# Didn't work, even though the first opennlp manual link above said that the "opennlp LangDetector model" command should take input from stdin: ./apache-opennlp-1.9.1/bin/opennlp LanguageDetector langdetect-183.bin < "Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei." ] UPDATE 24 July 2019: The following worked >$OPENNLP_HOME/bin/opennlp LanguageDetector $OPENNLP_HOME/models/langdetect-183.bin < opennlp-corpus/leipzig/data/mri_web_2011_100K-sentences.txt but it doesn't provide predictions. Not sure I understand what it did other than print the contents of the *sentences.txt file and end with: ... Average: 0.1 doc/s Total: 1 doc Runtime: 8.046s Execution time: 8.719 seconds 5. For writing Java code: To write the basic code, I followed the Java skeleton examples at * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect.classifying To get the imports correct, I searched for example/tutorial java code on using openNLP with the LanguageDetectorModel * Java code: Import files: https://www.tutorialkart.com/opennlp/language-detector-example-in-apache-opennlp/ * The tutorial link above also covers Java code to train detecting a particular language. apache-opennlp-1.9.1/lib/opennlp-tools-1.9.1.jar contains the imports we want, in particular import opennlp.tools.langdetect.*; import opennlp.tools.util.*; 6. Wrote the very basic form of MaoriDetector.java class. To compile and run: a. export OPENNLP_HOME=/Scratch/ak19/openNLP-lang-detect/apache-opennlp-1.9.1 b. wharariki:[115]/Scratch/ak19/openNLP-lang-detect/src>javac -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector.java c. wharariki:[116]/Scratch/ak19/openNLP-lang-detect/src>java -cp ".:$OPENNLP_HOME/lib/*" MaoriDetector (Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar) ------------------------------------------------- READING: GENERAL LINKS FOR LANGUAGE DETECTION ------------------------------------------------- https://www.slideshare.net/shuyo/short-text-language-detection-with-infinitygram-12949447 http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html -------------------------------------------------------------------------- READING: LINKS FOR SENTENCE DETECTION AND TRAINING A SENTENCE DETECTOR -------------------------------------------------------------------------- * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.sentdetect * SentenceDetectorTrainer java example: https://www.tutorialkart.com/opennlp/train-model-sentence-detection-java/ * SentenceDetectorTrainer and SentenceDetector from commandline: https://stackoverflow.com/questions/36516363/sentence-detection-with-opennlp