Changeset 33350


Ignore:
Timestamp:
2019-07-23T17:29:18+12:00 (4 years ago)
Author:
ak19
Message:

Better comments. Tested macronised vs unmacronised Māori language test string and both are detected as mri, but the unmacronised is detected with lower confidence. Added a note on that in the README.

Location:
gs3-extensions/maori-lang-detection
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/README.txt

    r33339 r33350  
    3838
    3939
    40 
    41 
    42 For reading materials, see the OLD README section below.
     40For links to background reading materials, see the OLD README section further below.
     41
     42
     43NOTE: The OpenNLP Language Detection Model can detect non-macronised Māori text too,
     44but as anticipated, the same text produces a lower confidence level for the language prediction. Compare:
     45
     46$maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector -
     47   Waiting to read text from STDIN... (press Ctrl-D when done entering text)>
     48   Ko tenei te Whare Wananga o Waikato e whakatau nei i nga iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o nga maunga whakaruru e tau awhi nei.
     49   Best language: mri
     50   Best language confidence: 0.5959533972070814
     51   Exitting program with returnVal 0...
     52
     53$maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector -
     54   Waiting to read text from STDIN... (press Ctrl-D when done entering text)>
     55   Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei.
     56   Best language: mri
     57   Best language confidence: 0.6825737450092515
     58   Exitting program with returnVal 0...
     59
    4360
    4461-------------------------
  • gs3-extensions/maori-lang-detection/src/MaoriTextDetector.java

    r33338 r33350  
    11/**
     2 * Class that uses OpenNLP with the Language Detection Model to determine, with a default
     3 * or configurable level of confidence, whether text (from a file or stdin) is in Māori or not.
     4 * Internal functions can be used for detecting any of the 103 languages currently supported by
     5 * the OpenNLP Language Detection Model.
     6 *
    27 * http://opennlp.apache.org/news/model-langdetect-183.html
    38 * language detector model: http://opennlp.apache.org/models.html
     
    813 *
    914 * This code was based on the information and sample code at the above links and the links dispersed throughout this file.
     15 * See also the accompanying README file.
     16 *
     17 * July 2019
    1018 */
    1119
     
    1624/**
    1725 * EXPORT OPENNLP_HOME environment variable to be your apache OpenNLP installation.
    18  * Then, to compile this program:
     26 * Create a folder called "models" within the $OPENNLP_HOME folder, and put the file "langdetect-183.bin" in there
     27 *    (which is the language detection model zipped up and renamed to .bin extension).
     28 *
     29 * Then, to compile this program, do the following from the "src" folder (the folder containing this java file):
    1930 *    maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java
    20  * To run this program, one of:
     31 *
     32 * To run this program, issue one of the following commands from the "src" folder (the folder containing this java file):
    2133 *
    2234 *    maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --help
     
    2537 *
    2638 *    maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector -
    27  *       which expects text to stream in from standard input.
     39 *       Press enter. This variant of the program expects text to stream in from standard input.
    2840 *       If entering text manually, then remember to press Ctrl-D to indicate the usual end of StdIn.
    2941 *
     
    3951    greater or equal to which determines that the best predicted language is acceptable to user of MaoriTextDetector. */
    4052    public final double MINIMUM_CONFIDENCE;
     53   
    4154    /** silentMode set to false means MaoriTextDetector won't print helpful messages while running. Set to true to run silently. */
    4255    public final boolean silentMode;
     
    4457    /** Language Detection Model file for OpenNLP is expected to be at $OPENNLP_HOME/models/langdetect-183.bin */
    4558    private final String LANG_DETECT_MODEL_RELATIVE_PATH = "models" + File.separator + "langdetect-183.bin";
     59
     60    /**
     61     * The LanguageDetectorModel object that will do the actual language detection/prediction for us.
     62     * Created once in the constructor, can be used as often as needed thereafter.
     63    */
    4664    private LanguageDetector myCategorizer = null;
    4765   
    48     /**
    49      * String taken from our university website
    50      * https://www.waikato.ac.nz/maori/
    51      */
     66    /** String taken from our university website, https://www.waikato.ac.nz/maori/ */
    5267    public static final String TEST_MRI_INPUT_TEXT = "Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei.";
    5368
     
    224239    /**
    225240     * Prints to STDOUT the predicted languages of the input text in order of descending confidence.
    226      * Unused.
     241     * UNUSED.
    227242     */
    228243    public void predictedLanguages(String text) {
     
    367382        System.exit(returnVal);
    368383    }   
    369    
     384
     385
     386    // 2. Finally, we can now do the actual language detection
    370387    try {
    371388        MaoriTextDetector maoriTextDetector = null;
Note: See TracChangeset for help on using the changeset viewer.