Changeset 33355 for gs3-extensions/maori-lang-detection/README.txt
- Timestamp:
- 2019-07-24T20:54:50+12:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/README.txt
r33350 r33355 8 8 svn co http://svn.greenstone.org/gs3-extensions/maori-lang-detection 9 9 10 - It contains the OpenNLP that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html 11 - The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin) 12 13 1. Once you've svn checked it out, create a folder called "models" and put the langdetect-183.bin file into it. 14 (This is just a zip file, but has to remain with the .bin extension in order for OpenNLP to use it. If you ever wish to see its contents, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.) 10 This checkout contains: 11 - tarball of the OpenNLP version that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html 12 - the folder "models-trainingdata-and-sampletxts", itself containing: 13 - langdetect-183.bin: The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin) 14 - mri-sent_trained.bin: our own custom generated model for Maori Sentence Detection 15 - mri-sent.train: the training text file for generating the mri-sent_trained.bin Maori Sentence Detection model 16 - sample_mri_paragraphs.txt: contains some text to test the Maori Sentence Detection model on. Its content is from Niupepa collection page http://www.nzdl.org/cgi-bin/library?gg=text&e=p-00000-00---off-0niupepa--00-0----0-10-0---0---0direct-10---4-------0-1l--11-en-50---20-about---00-0-1-00-0-0-11-1-0utfZz-8-00&a=p&p=about&l=mi&nw=utf-8) 17 - sample_maori_shorttext.txt: to test the MaoriTextDetector.java with 18 - "src" folder for Java classes, currently just MaoriTextDetector.java and its classfile. MaoriTextDetector.java uses the aforementioned LanguageDetectionModel, langdetect-183.bin, to detect whether input text from a file or stdin is in Maori or not 19 - gen_SentenceDetection_model.sh, our custom script that generates both the mri-sent.train and model mri-sent_trained.bin files mentioned above 20 - the script works on opennlp's leipzig corpus of 100k Maori sentences from 2011 to get its sample sentences into the correct format in the mri-sent.train file 21 - from this file containing training sentences, it generates the Sentence Detector Model, mri-sent_trained.bin 22 - mri-opennlp-corpus.tar.gz: a tarball containing the 100k Maori sentences opennlp corpus checked out with svn in its original directory structure from https://svn.apache.org/repos/bigdata/opennlp/trunk/mri_web_2011_100K-sentences.txt 23 24 25 1. Once you've svn checked out the maori-lang-detection from gs3-extensions, create a folder called "models" and put the files langdetect-183.bin and mri-sent_trained.bin from the folder "models-trainingdata-and-sampletxts" into it. 26 (These are just zip files, but have to remain with the .bin extension in order for OpenNLP to use them. If you ever wish to see the contents of such a .bin file, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.) 27 28 You can optionally put the file models-trainingdata-and-sampletxts/ 29 15 30 16 31 2. Next extract the apache-opennlp-1.9.1-bin.tar.gz. 17 32 This will create a folder called apache-opennlp-1.9.1. Move the "models" folder you created in step 1 into this folder. 33 18 34 19 35 3. Before you can compile or run the MaoriTextDetector program, you always have to prepare a terminal by setting up the environment for OpenNLP as follows: … … 25 41 Still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, you can now run: 26 42 maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java 43 27 44 28 45 5. To run the MaoriTextDetector program, you will need the JDK or JRE 7+ bin folder on your PATH and still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, … … 109 126 I got the binary version. I unzipped it. 110 127 128 111 129 2. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html 112 130 Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin … … 116 134 - But for using with openNLP, don't rename or unzip it. 117 135 136 118 137 3. UNNECESSARY: 119 138 I started by following the instructions at the botton of: … … 135 154 136 155 137 svn co https://svn.apache.org/repos/bigdata/opennlp/trunk --depth immediates 138 mv trunk opennlp-corpus 139 svn up --set-depth immediates 156 svn co --depth immediates --trust-server-cert --non-interactive https://svn.apache.org/repos/bigdata/opennlp/trunk opennlp-corpus 157 cd opennlp-corpus 158 svn up --set-depth immediates --trust-server-cert --non-interactive 159 cd leipzig 160 svn up --set-depth immediates --trust-server-cert --non-interactive 140 161 cd resources/ 141 wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity 162 wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity --trust-server-cert --non-interactive 142 163 cd ../data 143 svn up mri_web_2011_100K-sentences.txt 164 svn up --trust-server-cert --non-interactive mri_web_2011_100K-sentences.txt 165 166 (# UNNECESSARY TO DOWNLOAD: 144 167 svn up eng_wikipedia_2012_3M-sentences.txt 145 168 svn up nld_mixed_2012_1M-sentences.txt 146 169 svn up fra_mixed_2009_1M-sentences.txt 170 ) 147 171 148 172 cd .. … … 157 181 158 182 159 4. Attempting runningOpenNLP + LanguageDetectorModel's command line tools for language prediction183 4. Attempting to run OpenNLP + LanguageDetectorModel's command line tools for language prediction 160 184 161 185 I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages. … … 184 208 185 209 210 UPDATE 24 July 2019: The following worked 211 >$OPENNLP_HOME/bin/opennlp LanguageDetector $OPENNLP_HOME/models/langdetect-183.bin < opennlp-corpus/leipzig/data/mri_web_2011_100K-sentences.txt 212 but it doesn't provide predictions. Not sure I understand what it did other than print the contents of the *sentences.txt file and end with: 213 ... 214 Average: 0.1 doc/s 215 Total: 1 doc 216 Runtime: 8.046s 217 Execution time: 8.719 seconds 218 219 186 220 5. For writing Java code: 187 221 To write the basic code, I followed the Java skeleton examples at … … 210 244 211 245 (Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar) 246 247 ================ 248 249 UNORGANISED 250 251 https://www.slideshare.net/shuyo/short-text-language-detection-with-infinitygram-12949447 252 http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html 253 254 http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.sentdetect 255 256 SentenceDetectorTrainer example 257 https://www.tutorialkart.com/opennlp/train-model-sentence-detection-java/ 258 259 https://stackoverflow.com/questions/36516363/sentence-detection-with-opennlp
Note:
See TracChangeset
for help on using the changeset viewer.