Context Navigation

← Previous Change
Next Change →

MaoriTextDetector.java

Timestamp:

2019-10-18T22:20:06+13:00 (5 years ago)

Author:

ak19

Message:

Refactored MaoriTextDetector.java class into more general TextLanguageDetector.java superclass and just the MRI-specific methods, constructors and member vars remaining in MaoriTextDetector.java. Easier to read code. Makes superclass reusable for other languages that need a similar treatment.

File:

: 1 edited

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (modified) (8 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java

-              r33585
+              r33586
  * Class that uses OpenNLP with the Language Detection Model to determine, with a default
  * or configurable level of confidence, whether text (from a file or stdin) is in MÄori or not.
+ * Internal functions can be used for detecting any of the 103 languages currently supported by
+ * the OpenNLP Language Detection Model.
+ *
+ * http://opennlp.apache.org/news/model-langdetect-183.html
+ * language detector model: http://opennlp.apache.org/models.html
+ *        Pre-trained models for OpenNLP 1.5: http://opennlp.sourceforge.net/models-1.5/
+ * Use of Apache OpenNLP in general:
+ *   http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#intro.api
+ * Use of OpenNLP for language detection:
+ * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect
+ *
+ * This code was based on the information and sample code at the above links and the links dispersed throughout this file.
+ * See also the accompanying README file.
+ *
+ * July 2019
+ * July 2019.
+ *
+ * Oct 2019:
+ * - Uses a Sentence Model that we trained for MÄori (see bin/script/gen_SentenceDetection_model.sh)
+ * for being able to split MÄori language text into sentences.
+ * - Refactored into TextLanguageDetector as base class with this class now inheriting from it.
  */
 …
  * Create a folder called "models" within the $OPENNLP_HOME folder, and put the file "langdetect-183.bin" in there
  *    (which is the language detection model zipped up and renamed to .bin extension).
+ * Ensure that the mri-sent_trained.bin sentence model for MÄori that we trained also lives
+ * in the "models" folder.
+ *
  * Then, to compile this program, do the following from the "src" folder (the folder containing this java file):
 …
  * Also has information on how to run this class if it's in a Java package.
  */
 public class MaoriTextDetector {
+public class MaoriTextDetector extends TextLanguageDetector {
     /** The 3 letter language code for Maori in ISO 639-2 or ISO 639-3 */
     public static final String MAORI_3LETTER_CODE = "mri";
+    public static final double DEFAULT_MINIMUM_CONFIDENCE = 0.50;
+    /** Configurable: cut off minimum confidence value,
+    greater or equal to which determines that the best predicted language is acceptable to user of MaoriTextDetector. */
+    public final double MINIMUM_CONFIDENCE;
+    /** silentMode set to false means MaoriTextDetector won't print helpful messages while running. Set to true to run silently. */
+    public final boolean silentMode;
+    private final String OPENNLP_MODELS_RELATIVE_PATH = "models" + File.separator;
+    /** Language Detection Model file for OpenNLP is expected to be at $OPENNLP_HOME/models/langdetect-183.bin */
+    private final String LANG_DETECT_MODEL_RELATIVE_PATH = OPENNLP_MODELS_RELATIVE_PATH + "langdetect-183.bin";
+    /**
+     * The LanguageDetectorModel object that will do the actual language detection/prediction for us.
+     * Created once in the constructor, can be used as often as needed thereafter.
+    */
+    private LanguageDetector myCategorizer = null;
+    /**
+     * The Sentence Detection object that does the sentence splitting for the language
+     * the sentece model was trained for.
+     */
+    private SentenceDetectorME sentenceDetector = null;
     /** String taken from our university website, https://www.waikato.ac.nz/maori/ */
     public static final String TEST_MRI_INPUT_TEXT = "Ko tÄnei te Whare WÄnanga o Waikato e whakatau nei i ngÄ iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngÄ maunga whakaruru e tau awhi nei.";
 …
     public static final String TEST_ENG_INPUT_TEXT = "The main program exits with -1 if an Exception occurred when attempting to detect the text's language";
+    /** Constructor with default confidence for language detection.
+     * Uses the trained Maori sentence model.
+     */
     public MaoriTextDetector(boolean silentMode) throws Exception {
+    this(silentMode, DEFAULT_MINIMUM_CONFIDENCE);
+    }
+    /** Constructor that uses the sentence Model we trained for MÄori */
+    super(silentMode, DEFAULT_MINIMUM_CONFIDENCE, "mri-sent_trained.bin");
+    }
+    /** Constructor with configurable confidence level in language detection
+     * that uses the sentence Model we trained for MÄori */
     public MaoriTextDetector(boolean silentMode, double min_confidence) throws Exception {
+    this(silentMode, min_confidence, "mri-sent_trained.bin");
+    }
+    /** More general constructor that can use sentence detector models for other languages */
+    public MaoriTextDetector(boolean silentMode, double min_confidence,
+                 String sentenceModelFileName) throws Exception
+    {
+    this.silentMode = silentMode;
+    this.MINIMUM_CONFIDENCE = min_confidence;
+    // 1. Check we can find the Language Detect Model file in the correct location (check that $OPENNLP_HOME/models/langdetect-183.bin exists);
+    String langDetectModelPath = System.getenv("OPENNLP_HOME");
+    if(System.getenv("OPENNLP_HOME") == null) {
+        throw new Exception("\n\t*** Environment variable OPENNLP_HOME must be set to your Apache OpenNLP installation folder.");
+    }
+    langDetectModelPath = langDetectModelPath + File.separator + LANG_DETECT_MODEL_RELATIVE_PATH;
+    File langDetectModelBinFile = new File(langDetectModelPath);
+    if(!langDetectModelBinFile.exists()) {
+        throw new Exception("\n\t*** " + langDetectModelBinFile.getPath() + " doesn't exist."
+                + "\n\t*** Ensure the $OPENNLP_HOME folder contains a 'models' folder"
+                + "\n\t*** with the model file 'langdetect-183.bin' in it.");
+    }
+    // 2. Set up our language detector Model and the Categorizer for language predictions based on the Model.
+    // http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#intro.api
+    // https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html
+    try (InputStream modelIn = new FileInputStream(langDetectModelPath)) {
+        LanguageDetectorModel model = new LanguageDetectorModel(modelIn);
+        // http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect
+        this.myCategorizer = new LanguageDetectorME(model);
+    }/*catch(Exception e) {
+        e.printStackTrace();
+        }*/
+    // instantiating function should handle critical exceptions. Constructors shouldn't.
+    // 3. Set up our sentence model and SentenceDetector object
+    String sentenceModelPath = System.getenv("OPENNLP_HOME") + File.separator
+        + OPENNLP_MODELS_RELATIVE_PATH + sentenceModelFileName; // "mri-sent_trained.bin" default
+    File sentenceModelBinFile = new File(sentenceModelPath);
+    if(!sentenceModelBinFile.exists()) {
+        throw new Exception("\n\t*** " + sentenceModelBinFile.getPath() + " doesn't exist."
+                + "\n\t*** Ensure the $OPENNLP_HOME folder contains a 'models' folder"
+                + "\n\t*** with the model file "+sentenceModelFileName+" in it.");
+    }
+    try (InputStream modelIn = new FileInputStream(sentenceModelPath)) {
+        // https://www.tutorialspoint.com/opennlp/opennlp_sentence_detection.htm
+        SentenceModel sentenceModel = new SentenceModel(modelIn);
+        this.sentenceDetector = new SentenceDetectorME(sentenceModel);
+    } // instantiating function should handle this critical exception
+    }
+    /**
+     * In this class' constructor, need to have set up the Sentence Detection Model
+     * for the langCode passed in to this function in order for the output to make
+     * sense for that language.
+     */
+    public ArrayList<String> getAllSentencesInLanguage(String langCode, String text, double confidenceCutoff)
+    {
+    // we'll be storing just those sentences in text that are in the denoted language code
+    ArrayList<String> mriSentences = new ArrayList<String>();
+    // OpenNLP language detection works best with a minimum of 2 sentences
+    // See https://opennlp.apache.org/news/model-langdetect-183.html
+    // "It is important to note that this model is trained for and works well with
+    // longer texts that have at least 2 sentences or more from the same language."
+    // For evaluating single languages, I used a very small data set and found that
+    // if the primary language detected is MRI AND if the confidence is >= 0.1, the
+    // results appear reasonably to be in te reo MÄori.
+    String[] sentences = sentenceDetector.sentDetect(text);
+    for(int i = 0; i < sentences.length; i++) {
+        String sentence = sentences[i];
+        //System.err.println(sentence);
+        Language bestLanguage = myCategorizer.predictLanguage(sentence);
+        double confidence = bestLanguage.getConfidence();
+        if(bestLanguage.getLang().equals(langCode) && confidence >= confidenceCutoff) {
+        System.err.println("Adding sentence: " + sentence + "\n");
+        mriSentences.add(sentence);
+        } else {
+        System.err.println("SKIPPING sentence: " + sentence + "\n");
+        }
+    }
+    return mriSentences;
+    }
+    super(silentMode, min_confidence, "mri-sent_trained.bin");
+    }
+    /**
+     * Function that takes a text and returns those sentences in MÄori.
+     * @param text: the string of text from which sentences in the requested
+     * language are to be identified and returned.
+     * @return an ArrayList of sentences in the text parameter that are
+     * in the requested language.
+     */
     public ArrayList<String> getAllSentencesInMaori(String text) throws Exception {
     // big assumption here: that we can split incoming text into sentences
 …
+    }
-    /** @param langCode is 3 letter language code, ISO 639-2/3
-     * https://www.loc.gov/standards/iso639-2/php/code_list.php
-     * https://en.wikipedia.org/wiki/ISO_639-3
-     * @return true if the input text is Maori (mri) with MINIMUM_CONFIDENCE levels of confidence (if set,
-     * else DEFAULT_MINIMUM_CONFIDENCE levels of confidence).
-     */
-    public boolean isTextInLanguage(String langCode, String text) {
-    // Get the most probable language
-    Language bestLanguage = myCategorizer.predictLanguage(text);
-    doPrint("Best language: " + bestLanguage.getLang());
-    doPrint("Best language confidence: " + bestLanguage.getConfidence());
-    return (bestLanguage.getLang().equals(langCode) && bestLanguage.getConfidence() >= this.MINIMUM_CONFIDENCE);
+    }
     /**
 …
     public boolean isTextInMaori(BufferedReader reader) throws Exception {
     return isTextInLanguage(MAORI_3LETTER_CODE, reader);
+    }
-    /**
-     * Handle "smaller" textfiles/streams of text read in.
-     * Return value is the same as for isTextInLanguage(String langCode, String text);
-     */
-    public boolean isTextInLanguage(String langCode, BufferedReader reader) throws Exception {
-    // https://stackoverflow.com/questions/326390/how-do-i-create-a-java-string-from-the-contents-of-a-file
-    StringBuilder text = new StringBuilder();
-    String line = null;
-    while((line = reader.readLine()) != null) { // readLine removes newline separator
-        text.append(line + "\n"); // add back (unix style) line ending
+    }
-    return isTextInLanguage(langCode, text.toString());
+    }
 …
+    }
-    /**
-     * Rudimentary attempt to deal with very large files.
-     * Return value is the same as for isTextInLanguage(String langCode, String text);
-     */
-    public boolean isLargeTextInLanguage(String langCode, BufferedReader reader) throws Exception {
-    // https://stackoverflow.com/questions/326390/how-do-i-create-a-java-string-from-the-contents-of-a-file
-    final int NUM_LINES = 100; // arbitrary 100 lines read, predict language, calculate confidence
-    StringBuilder text = new StringBuilder();
-    String line = null;
-    double cumulativeConfidence = 0;
-    int numLoops = 0;
-    int i = 0;
-    String language = null;
-    while((line = reader.readLine()) != null) { // readLine removes newline separator
-        text.append(line + "\n"); // add back (unix style) line ending
-        i++; // read nth line of numLoop
-        if(i == NUM_LINES) { // arbitrary 100 lines read, predict language, calculate confidence
-        Language bestLanguage = myCategorizer.predictLanguage(text.toString());
-        if(language != null && !bestLanguage.getLang().equals(language)) { // predicted lang of current n lines not the same as predicted lang for prev n lines
-            doPrintErr("**** WARNING: text seems to contain content in multiple languages or unable to consistently predict the same language.");
+        }
-        language = bestLanguage.getLang();
-        cumulativeConfidence += bestLanguage.getConfidence();
-        doPrintErr("Best predicted language for last " + NUM_LINES + " lines: " + language + "(confidence: " + bestLanguage.getConfidence() + ")");
-        // finished analysing language of NUM_LINES of text
-        text = new StringBuilder();
-        i = 0;
-        numLoops++;
+        }
+    }
-    // process any (remaining) text that was less than n NUM_LINES
-    if(!text.toString().equals("")) {
-        text.append(line + "\n"); // add back (unix style) line ending
-        i++;
-        Language bestLanguage = myCategorizer.predictLanguage(text.toString());
-        if(language != null && !bestLanguage.getLang().equals(language)) { // predicted lang of current n lines not the same as predicted lang for prev n lines
-        doPrintErr("**** WARNING: text seems to contain content in multiple languages or unable to consistently predict the same language.");
+        }
-        language = bestLanguage.getLang();
-        cumulativeConfidence += bestLanguage.getConfidence();
-        doPrintErr("Best predicted language for final " + NUM_LINES + " lines: " + language + "(confidence: " + bestLanguage.getConfidence() + ")");
+    }
-    int totalLinesRead = numLoops * NUM_LINES + i; // not used
-    double avgConfidence = cumulativeConfidence/(numLoops + 1); // not quite the average as the text processed outside the loop may have fewer lines than NUM_LINES
-    return (language.equals(langCode) && avgConfidence >= this.MINIMUM_CONFIDENCE);
+    }
-    /**
-     * Prints to STDOUT the predicted languages of the input text in order of descending confidence.
-     * UNUSED.
-     */
-    public void predictedLanguages(String text) {
-    // Get an array with the most probable languages
-    Language[] languages = myCategorizer.predictLanguages(text);
-    if(languages == null || languages.length <= 0) {
-        doPrintErr("No languages predicted for the input text");
-    } else {
-        for(int i = 0; i < languages.length; i++) {
-        doPrint("Language prediction " + i + ": " + languages[i]);
+        }
+    }
+    }
-    public void doPrint(String msg) {
-    doPrint(this.silentMode, msg);
+    }
-    public void doPrintErr(String msg) {
-    doPrintErr(this.silentMode, msg);
+    }
     /********** STATIC METHODS *************/
 …
+        }
+        // TODO
+        maoriTextDetector.getAllSentencesInMaori(
+                            "Primary sources ~ Published Maramataka Mo Te Tau 1885, Nepia: Te Haaringi, Kai-ta Pukapuka, kei Hehitingi Tiriti, 1884. Maramataka Mo Te Tau 1886, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1885. Maramataka Mo Te Tau 1887, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1886. Maramataka Mo Te Tau 1888, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1887. Maramataka Mo Te Tau 1889, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1888. Maramataka Mo Te Tau 1890, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1889. Maramataka Mo Te Tau 1891, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1890. Maramataka Mo Te Tau 1892, Nepia: Na te Haaringi, i ta ki tona Whare Perehi Pukapuka, 1891. Maramataka Mo Te Tau 1893, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1892. Maramataka Mo Te Tau 1894, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1893. Maramataka Me Te Tau 1895, Kihipane: Na te Muri i Ta ki tona whare perehi pukapuka, 1894. Maramataka Mo Te Tau 1896, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1895. Maramataka Mo Te Tau 1897, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka 1896. Maramataka Mo Te Tau 1898, Turanga: Na te Wiremu Hapata i ta ki Te Rau Kahikatea, 1897. Ko Te Paipera Tapu Ara, Ko Te Kawenata Tawhito Me Te Kawenata Hou, He Mea Whakamaori Mai No Nga Reo I Oroko-Tuhituhia Ai, Ranana: He mea ta ki te perehi a W.M.Watts ma te Komiti Ta Paipera mo Ingarangi mo Te Ao Katoa, 1868. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona, Me Nga Himene, Ranana: I taia tenei ki te perehi o te Komiti mo te whakapuaki i to mohiotanga ki a te Karaiti, 1858. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona. 1883. The Book of Common Prayer, and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Use of the United Church of England and Ireland: Together with the Proper Lessons for Sundays and Other Holy-Days, and a New Version of the Psalms of David, Oxford: Printed at 134 the University Press, 1852. The Book of Common Prayer and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Church of England: Together with the Psalter or Psalms of David, Printed as They Are to Be Sung or Said in Churches: And the Form and Manner of Making, Ordaining, and Consecrating of Bishops, Priests, and Deacons, London: G.E. Eyre and W. Spottiswoode, after 1871 but before 1877. Brown, A.N., The Journals of A.N. Brown C.M.S. Missionary Tauranga Covering the Years 1840 to 1842, Tauranga: The Elms Trust, 1990 (Commemorative Edition). ______________, Select Sermons of A.N. Brown, Tauranga: The Elms Trust, 1997. Fitzgerald, Caroline (ed.), Te Wiremu Henry Williams: Early Years in the North, Wellington: Huia Publishers, 2011. The Hawke's Bay Almanac, Napier: James Wood, Hawke's Bay Herald, 1862, 1863, 1867.");
+        //maoriTextDetector.getAllSentencesInMaori();

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33586 for gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java

Legend:

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java

Download in other formats: