Ignore:
Timestamp:
2019-07-24T20:54:50+12:00 (5 years ago)
Author:
ak19
Message:

Changes for adding in the new gen_SentenceDetection_model.sh script, which automates generating a Sentence Detector model for the Maori language, mri-sent_trained.bin, trained on the mri-sent.train file generated by appropritely formatting the 100k Maori sentences file from the opennlp corpus 2011

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/README.txt

    r33350 r33355  
    88    svn co http://svn.greenstone.org/gs3-extensions/maori-lang-detection
    99
    10 - It contains the OpenNLP that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html
    11 - The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin)
    12 
    13 1. Once you've svn checked it out, create a folder called "models" and put the langdetect-183.bin file into it.
    14 (This is just a zip file, but has to remain with the .bin extension in order for OpenNLP to use it. If you ever wish to see its contents, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.)
     10This checkout contains:
     11- tarball of the OpenNLP version that was current at the time of the commit, but you can also get it from its original site http://opennlp.apache.org/download.html
     12- the folder "models-trainingdata-and-sampletxts", itself containing:
     13    - langdetect-183.bin: The LanguageDetectionModel to be used by OpenNLP is also included (again, current at the time of commit), but you can get it from http://opennlp.apache.org/models.html (direct link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin)
     14    - mri-sent_trained.bin: our own custom generated model for Maori Sentence Detection
     15    - mri-sent.train: the training text file for generating the mri-sent_trained.bin Maori Sentence Detection model
     16    - sample_mri_paragraphs.txt: contains some text to test the Maori Sentence Detection model on. Its content is from Niupepa collection page http://www.nzdl.org/cgi-bin/library?gg=text&e=p-00000-00---off-0niupepa--00-0----0-10-0---0---0direct-10---4-------0-1l--11-en-50---20-about---00-0-1-00-0-0-11-1-0utfZz-8-00&a=p&p=about&l=mi&nw=utf-8)
     17    - sample_maori_shorttext.txt: to test the MaoriTextDetector.java with
     18- "src" folder for Java classes, currently just MaoriTextDetector.java and its classfile. MaoriTextDetector.java uses the aforementioned LanguageDetectionModel, langdetect-183.bin, to detect whether input text from a file or stdin is in Maori or not
     19- gen_SentenceDetection_model.sh, our custom script that generates both the mri-sent.train and model mri-sent_trained.bin files mentioned above
     20    - the script works on opennlp's leipzig corpus of 100k Maori sentences from 2011 to get its sample sentences into the correct format in the mri-sent.train file
     21    - from this file containing training sentences, it generates the Sentence Detector Model, mri-sent_trained.bin
     22- mri-opennlp-corpus.tar.gz: a tarball containing the 100k Maori sentences opennlp corpus checked out with svn in its original directory structure from https://svn.apache.org/repos/bigdata/opennlp/trunk/mri_web_2011_100K-sentences.txt
     23
     24
     251. Once you've svn checked out the maori-lang-detection from gs3-extensions, create a folder called "models" and put the files langdetect-183.bin and mri-sent_trained.bin from the folder "models-trainingdata-and-sampletxts" into it.
     26(These are just zip files, but have to remain with the .bin extension in order for OpenNLP to use them. If you ever wish to see the contents of such a .bin file, you can rename to .zip and use the Archive Manager or other Zip tool to inspect the contents.)
     27
     28You can optionally put the file models-trainingdata-and-sampletxts/
     29
    1530
    16312. Next extract the apache-opennlp-1.9.1-bin.tar.gz.
    1732This will create a folder called apache-opennlp-1.9.1. Move the "models" folder you created in step 1 into this folder.
     33
    1834
    19353. Before you can compile or run the MaoriTextDetector program, you always have to prepare a terminal by setting up the environment for OpenNLP as follows:
     
    2541Still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3, you can now run:
    2642      maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java
     43
    2744
    28455. To run the MaoriTextDetector program, you will need the JDK or JRE 7+ bin folder on your PATH and still in the SAME terminal as where you set up the OPENNLP_HOME environment in step 3,
     
    109126I got the binary version. I unzipped it.
    110127
     128
    1111292. Get the model, langdetect-183.bin, from: http://opennlp.apache.org/models.html
    112130    Direct download link: https://www.apache.org/dyn/closer.cgi/opennlp/models/langdetect/1.8.3/langdetect-183.bin
     
    116134- But for using with openNLP, don't rename or unzip it.
    117135
     136
    1181373. UNNECESSARY:
    119138I started by following the instructions at the botton of:
     
    135154
    136155
    137     svn co https://svn.apache.org/repos/bigdata/opennlp/trunk --depth immediates
    138     mv trunk opennlp-corpus
    139     svn up --set-depth immediates
     156    svn co --depth immediates --trust-server-cert --non-interactive https://svn.apache.org/repos/bigdata/opennlp/trunk opennlp-corpus
     157    cd opennlp-corpus
     158    svn up --set-depth immediates --trust-server-cert --non-interactive
     159    cd leipzig
     160    svn up --set-depth immediates --trust-server-cert --non-interactive
    140161    cd resources/
    141     wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity
     162    wharariki:[197]/Scratch/ak19/openNLP-lang-detect/opennlp-corpus/leipzig/resources>svn up --set-depth infinity --trust-server-cert --non-interactive
    142163    cd ../data
    143     svn up mri_web_2011_100K-sentences.txt
     164    svn up --trust-server-cert --non-interactive mri_web_2011_100K-sentences.txt
     165   
     166    (# UNNECESSARY TO DOWNLOAD:
    144167    svn up eng_wikipedia_2012_3M-sentences.txt
    145168    svn up nld_mixed_2012_1M-sentences.txt
    146169    svn up fra_mixed_2009_1M-sentences.txt
     170    )
    147171
    148172    cd ..
     
    157181
    158182
    159 4. Attempting running OpenNLP + LanguageDetectorModel's command line tools for language prediction
     1834. Attempting to run OpenNLP + LanguageDetectorModel's command line tools for language prediction
    160184
    161185I couldn't get prediction to work from the command line, so I ended up writing that part as a Java class. But I could run some part of the command line tools with the instructions on the following pages.
     
    184208
    185209
     210UPDATE 24 July 2019: The following worked
     211    >$OPENNLP_HOME/bin/opennlp LanguageDetector $OPENNLP_HOME/models/langdetect-183.bin < opennlp-corpus/leipzig/data/mri_web_2011_100K-sentences.txt
     212but it doesn't provide predictions. Not sure I understand what it did other than print the contents of the *sentences.txt file and end with:
     213        ...
     214        Average: 0.1 doc/s
     215        Total: 1 doc
     216        Runtime: 8.046s
     217        Execution time: 8.719 seconds
     218
     219
    1862205. For writing Java code:
    187221To write the basic code, I followed the Java skeleton examples at
     
    210244
    211245(Though possibly the only jar file needed in $OPENNLP_HOME/lib is opennlp-tools-1.9.1.jar)
     246
     247================
     248
     249UNORGANISED
     250
     251https://www.slideshare.net/shuyo/short-text-language-detection-with-infinitygram-12949447
     252http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html
     253
     254http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.cli.sentdetect
     255
     256SentenceDetectorTrainer example
     257https://www.tutorialkart.com/opennlp/train-model-sentence-detection-java/
     258
     259https://stackoverflow.com/questions/36516363/sentence-detection-with-opennlp
Note: See TracChangeset for help on using the changeset viewer.