Changeset 33350

Show
Ignore:
Timestamp:
23.07.2019 17:29:18 (4 weeks ago)
Author:
ak19
Message:

Better comments. Tested macronised vs unmacronised Māori language test string and both are detected as mri, but the unmacronised is detected with lower confidence. Added a note on that in the README.

Location:
gs3-extensions/maori-lang-detection
Files:
3 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/README.txt

    r33339 r33350  
    3838 
    3939 
    40  
    41  
    42 For reading materials, see the OLD README section below. 
     40For links to background reading materials, see the OLD README section further below. 
     41 
     42 
     43NOTE: The OpenNLP Language Detection Model can detect non-macronised Māori text too, 
     44but as anticipated, the same text produces a lower confidence level for the language prediction. Compare: 
     45 
     46$maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector - 
     47   Waiting to read text from STDIN... (press Ctrl-D when done entering text)> 
     48   Ko tenei te Whare Wananga o Waikato e whakatau nei i nga iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o nga maunga whakaruru e tau awhi nei. 
     49   Best language: mri 
     50   Best language confidence: 0.5959533972070814 
     51   Exitting program with returnVal 0... 
     52 
     53$maori-lang-detection/src>java -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector - 
     54   Waiting to read text from STDIN... (press Ctrl-D when done entering text)> 
     55   Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei. 
     56   Best language: mri 
     57   Best language confidence: 0.6825737450092515 
     58   Exitting program with returnVal 0... 
     59 
    4360 
    4461------------------------- 
  • gs3-extensions/maori-lang-detection/src/MaoriTextDetector.java

    r33338 r33350  
    11/** 
     2 * Class that uses OpenNLP with the Language Detection Model to determine, with a default 
     3 * or configurable level of confidence, whether text (from a file or stdin) is in Māori or not. 
     4 * Internal functions can be used for detecting any of the 103 languages currently supported by 
     5 * the OpenNLP Language Detection Model. 
     6 *  
    27 * http://opennlp.apache.org/news/model-langdetect-183.html 
    38 * language detector model: http://opennlp.apache.org/models.html 
     
    813 *  
    914 * This code was based on the information and sample code at the above links and the links dispersed throughout this file. 
     15 * See also the accompanying README file. 
     16 * 
     17 * July 2019 
    1018 */ 
    1119 
     
    1624/** 
    1725 * EXPORT OPENNLP_HOME environment variable to be your apache OpenNLP installation. 
    18  * Then, to compile this program: 
     26 * Create a folder called "models" within the $OPENNLP_HOME folder, and put the file "langdetect-183.bin" in there 
     27 *    (which is the language detection model zipped up and renamed to .bin extension). 
     28 * 
     29 * Then, to compile this program, do the following from the "src" folder (the folder containing this java file): 
    1930 *    maori-lang-detection/src$ javac -cp ".:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" MaoriTextDetector.java 
    20  * To run this program, one of: 
     31 * 
     32 * To run this program, issue one of the following commands from the "src" folder (the folder containing this java file): 
    2133 * 
    2234 *    maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector --help 
     
    2537 * 
    2638 *    maori-lang-detection/src$ java -cp ".:$OPENNLP_HOME/lib/*" MaoriTextDetector - 
    27  *       which expects text to stream in from standard input. 
     39 *       Press enter. This variant of the program expects text to stream in from standard input. 
    2840 *       If entering text manually, then remember to press Ctrl-D to indicate the usual end of StdIn. 
    2941 * 
     
    3951    greater or equal to which determines that the best predicted language is acceptable to user of MaoriTextDetector. */ 
    4052    public final double MINIMUM_CONFIDENCE; 
     53     
    4154    /** silentMode set to false means MaoriTextDetector won't print helpful messages while running. Set to true to run silently. */ 
    4255    public final boolean silentMode; 
     
    4457    /** Language Detection Model file for OpenNLP is expected to be at $OPENNLP_HOME/models/langdetect-183.bin */ 
    4558    private final String LANG_DETECT_MODEL_RELATIVE_PATH = "models" + File.separator + "langdetect-183.bin"; 
     59 
     60    /** 
     61     * The LanguageDetectorModel object that will do the actual language detection/prediction for us. 
     62     * Created once in the constructor, can be used as often as needed thereafter. 
     63    */ 
    4664    private LanguageDetector myCategorizer = null; 
    4765     
    48     /**  
    49      * String taken from our university website 
    50      * https://www.waikato.ac.nz/maori/ 
    51      */ 
     66    /** String taken from our university website, https://www.waikato.ac.nz/maori/ */ 
    5267    public static final String TEST_MRI_INPUT_TEXT = "Ko tēnei te Whare Wānanga o Waikato e whakatau nei i ngā iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngā maunga whakaruru e tau awhi nei."; 
    5368 
     
    224239    /** 
    225240     * Prints to STDOUT the predicted languages of the input text in order of descending confidence. 
    226      * Unused. 
     241     * UNUSED. 
    227242     */ 
    228243    public void predictedLanguages(String text) { 
     
    367382        System.exit(returnVal); 
    368383    }    
    369      
     384 
     385 
     386    // 2. Finally, we can now do the actual language detection 
    370387    try { 
    371388        MaoriTextDetector maoriTextDetector = null;