Ignore:
Timestamp:
2019-07-23T17:29:18+12:00 (5 years ago)
Author:
ak19
Message:

Better comments. Tested macronised vs unmacronised Māori language test string and both are detected as mri, but the unmacronised is detected with lower confidence. Added a note on that in the README.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/src/MaoriTextDetector.java

    r33338 r33350  
    11/**
     2 * Class that uses OpenNLP with the Language Detection Model to determine, with a default
     3 * or configurable level of confidence, whether text (from a file or stdin) is in Māori or not.
     4 * Internal functions can be used for detecting any of the 103 languages currently supported by
     5 * the OpenNLP Language Detection Model.
     6 *
    27 * http://opennlp.apache.org/news/model-langdetect-183.html
    38 * language detector model: http://opennlp.apache.org/models.html
     
    813 *
    914 * This code was based on the information and sample code at the above links and the links dispersed throughout this file.
     15 * See also the accompanying README file.
     16 *
     17 * July 2019
    1018 */
    1119
     
    1624/**
    1725 * EXPORT OPENNLP_HOME environment variable to be your apache OpenNLP installation.
    18  * Then, to compile this program:
     26 * Create a folder called "models" within the $OPENNLP_HOME folder, and put the file "langdetect-183.bin" in there
     27 *    (which is the language detection model zipped up and renamed to .bin extension).
     28 *
     29 * Then, to compile this program, do the following from the "src" folder (the folder containing this java file):
    1930 *    maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java
    20  * To run this program, one of:
     31 *
     32 * To run this program, issue one of the following commands from the "src" folder (the folder containing this java file):
    2133 *
    2234 *    maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --help
     
    2537 *
    2638 *    maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector -
    27  *       which expects text to stream in from standard input.
     39 *       Press enter. This variant of the program expects text to stream in from standard input.
    2840 *       If entering text manually, then remember to press Ctrl-D to indicate the usual end of StdIn.
    2941 *
     
    3951    greater or equal to which determines that the best predicted language is acceptable to user of MaoriTextDetector. */
    4052    public final double MINIMUM_CONFIDENCE;
     53   
    4154    /** silentMode set to false means MaoriTextDetector won't print helpful messages while running. Set to true to run silently. */
    4255    public final boolean silentMode;
     
    4457    /** Language Detection Model file for OpenNLP is expected to be at $OPENNLP_HOME/models/langdetect-183.bin */
    4558    private final String LANG_DETECT_MODEL_RELATIVE_PATH = "models" + File.separator + "langdetect-183.bin";
     59
     60    /**
     61     * The LanguageDetectorModel object that will do the actual language detection/prediction for us.
     62     * Created once in the constructor, can be used as often as needed thereafter.
     63    */
    4664    private LanguageDetector myCategorizer = null;
    4765   
    48     /**
    49      * String taken from our university website
    50      * https://www.waikato.ac.nz/maori/
    51      */
     66    /** String taken from our university website, https://www.waikato.ac.nz/maori/ */
    5267    public static final String TEST_MRI_INPUT_TEXT = "Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei.";
    5368
     
    224239    /**
    225240     * Prints to STDOUT the predicted languages of the input text in order of descending confidence.
    226      * Unused.
     241     * UNUSED.
    227242     */
    228243    public void predictedLanguages(String text) {
     
    367382        System.exit(returnVal);
    368383    }   
    369    
     384
     385
     386    // 2. Finally, we can now do the actual language detection
    370387    try {
    371388        MaoriTextDetector maoriTextDetector = null;
Note: See TracChangeset for help on using the changeset viewer.