Changeset 33586 for gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java
- Timestamp:
- 2019-10-18T22:20:06+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java
r33585 r33586 2 2 * Class that uses OpenNLP with the Language Detection Model to determine, with a default 3 3 * or configurable level of confidence, whether text (from a file or stdin) is in MÄori or not. 4 * Internal functions can be used for detecting any of the 103 languages currently supported by 5 * the OpenNLP Language Detection Model. 6 * 7 * http://opennlp.apache.org/news/model-langdetect-183.html 8 * language detector model: http://opennlp.apache.org/models.html 9 * Pre-trained models for OpenNLP 1.5: http://opennlp.sourceforge.net/models-1.5/ 10 * Use of Apache OpenNLP in general: 11 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#intro.api 12 * Use of OpenNLP for language detection: 13 * http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect 14 * 15 * This code was based on the information and sample code at the above links and the links dispersed throughout this file. 16 * See also the accompanying README file. 17 * 18 * July 2019 4 * July 2019. 5 * 6 * Oct 2019: 7 * - Uses a Sentence Model that we trained for MÄori (see bin/script/gen_SentenceDetection_model.sh) 8 * for being able to split MÄori language text into sentences. 9 * - Refactored into TextLanguageDetector as base class with this class now inheriting from it. 19 10 */ 20 11 … … 32 23 * Create a folder called "models" within the $OPENNLP_HOME folder, and put the file "langdetect-183.bin" in there 33 24 * (which is the language detection model zipped up and renamed to .bin extension). 25 * Ensure that the mri-sent_trained.bin sentence model for MÄori that we trained also lives 26 * in the "models" folder. 34 27 * 35 28 * Then, to compile this program, do the following from the "src" folder (the folder containing this java file): … … 49 42 * Also has information on how to run this class if it's in a Java package. 50 43 */ 51 public class MaoriTextDetector {44 public class MaoriTextDetector extends TextLanguageDetector { 52 45 /** The 3 letter language code for Maori in ISO 639-2 or ISO 639-3 */ 53 46 public static final String MAORI_3LETTER_CODE = "mri"; 54 public static final double DEFAULT_MINIMUM_CONFIDENCE = 0.50; 55 56 /** Configurable: cut off minimum confidence value, 57 greater or equal to which determines that the best predicted language is acceptable to user of MaoriTextDetector. */ 58 public final double MINIMUM_CONFIDENCE; 59 60 /** silentMode set to false means MaoriTextDetector won't print helpful messages while running. Set to true to run silently. */ 61 public final boolean silentMode; 62 63 private final String OPENNLP_MODELS_RELATIVE_PATH = "models" + File.separator; 64 65 /** Language Detection Model file for OpenNLP is expected to be at $OPENNLP_HOME/models/langdetect-183.bin */ 66 private final String LANG_DETECT_MODEL_RELATIVE_PATH = OPENNLP_MODELS_RELATIVE_PATH + "langdetect-183.bin"; 67 68 /** 69 * The LanguageDetectorModel object that will do the actual language detection/prediction for us. 70 * Created once in the constructor, can be used as often as needed thereafter. 71 */ 72 private LanguageDetector myCategorizer = null; 73 74 /** 75 * The Sentence Detection object that does the sentence splitting for the language 76 * the sentece model was trained for. 77 */ 78 private SentenceDetectorME sentenceDetector = null; 79 47 80 48 /** String taken from our university website, https://www.waikato.ac.nz/maori/ */ 81 49 public static final String TEST_MRI_INPUT_TEXT = "Ko tÄnei te Whare WÄnanga o Waikato e whakatau nei i ngÄ iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngÄ maunga whakaruru e tau awhi nei."; … … 84 52 public static final String TEST_ENG_INPUT_TEXT = "The main program exits with -1 if an Exception occurred when attempting to detect the text's language"; 85 53 86 54 /** Constructor with default confidence for language detection. 55 * Uses the trained Maori sentence model. 56 */ 87 57 public MaoriTextDetector(boolean silentMode) throws Exception { 88 this(silentMode, DEFAULT_MINIMUM_CONFIDENCE); 89 } 90 91 /** Constructor that uses the sentence Model we trained for MÄori */ 58 super(silentMode, DEFAULT_MINIMUM_CONFIDENCE, "mri-sent_trained.bin"); 59 } 60 61 /** Constructor with configurable confidence level in language detection 62 * that uses the sentence Model we trained for MÄori */ 92 63 public MaoriTextDetector(boolean silentMode, double min_confidence) throws Exception { 93 this(silentMode, min_confidence, "mri-sent_trained.bin"); 94 } 95 96 /** More general constructor that can use sentence detector models for other languages */ 97 public MaoriTextDetector(boolean silentMode, double min_confidence, 98 String sentenceModelFileName) throws Exception 99 { 100 this.silentMode = silentMode; 101 this.MINIMUM_CONFIDENCE = min_confidence; 102 103 // 1. Check we can find the Language Detect Model file in the correct location (check that $OPENNLP_HOME/models/langdetect-183.bin exists); 104 String langDetectModelPath = System.getenv("OPENNLP_HOME"); 105 if(System.getenv("OPENNLP_HOME") == null) { 106 throw new Exception("\n\t*** Environment variable OPENNLP_HOME must be set to your Apache OpenNLP installation folder."); 107 } 108 langDetectModelPath = langDetectModelPath + File.separator + LANG_DETECT_MODEL_RELATIVE_PATH; 109 File langDetectModelBinFile = new File(langDetectModelPath); 110 if(!langDetectModelBinFile.exists()) { 111 throw new Exception("\n\t*** " + langDetectModelBinFile.getPath() + " doesn't exist." 112 + "\n\t*** Ensure the $OPENNLP_HOME folder contains a 'models' folder" 113 + "\n\t*** with the model file 'langdetect-183.bin' in it."); 114 } 115 116 117 // 2. Set up our language detector Model and the Categorizer for language predictions based on the Model. 118 // http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#intro.api 119 // https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html 120 try (InputStream modelIn = new FileInputStream(langDetectModelPath)) { 121 122 LanguageDetectorModel model = new LanguageDetectorModel(modelIn); 123 124 // http://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.langdetect 125 this.myCategorizer = new LanguageDetectorME(model); 126 }/*catch(Exception e) { 127 e.printStackTrace(); 128 }*/ 129 130 // instantiating function should handle critical exceptions. Constructors shouldn't. 131 132 133 134 // 3. Set up our sentence model and SentenceDetector object 135 String sentenceModelPath = System.getenv("OPENNLP_HOME") + File.separator 136 + OPENNLP_MODELS_RELATIVE_PATH + sentenceModelFileName; // "mri-sent_trained.bin" default 137 File sentenceModelBinFile = new File(sentenceModelPath); 138 if(!sentenceModelBinFile.exists()) { 139 throw new Exception("\n\t*** " + sentenceModelBinFile.getPath() + " doesn't exist." 140 + "\n\t*** Ensure the $OPENNLP_HOME folder contains a 'models' folder" 141 + "\n\t*** with the model file "+sentenceModelFileName+" in it."); 142 } 143 try (InputStream modelIn = new FileInputStream(sentenceModelPath)) { 144 // https://www.tutorialspoint.com/opennlp/opennlp_sentence_detection.htm 145 SentenceModel sentenceModel = new SentenceModel(modelIn); 146 this.sentenceDetector = new SentenceDetectorME(sentenceModel); 147 148 } // instantiating function should handle this critical exception 149 } 150 151 /** 152 * In this class' constructor, need to have set up the Sentence Detection Model 153 * for the langCode passed in to this function in order for the output to make 154 * sense for that language. 155 */ 156 public ArrayList<String> getAllSentencesInLanguage(String langCode, String text, double confidenceCutoff) 157 { 158 159 // we'll be storing just those sentences in text that are in the denoted language code 160 ArrayList<String> mriSentences = new ArrayList<String>(); 161 // OpenNLP language detection works best with a minimum of 2 sentences 162 // See https://opennlp.apache.org/news/model-langdetect-183.html 163 // "It is important to note that this model is trained for and works well with 164 // longer texts that have at least 2 sentences or more from the same language." 165 166 // For evaluating single languages, I used a very small data set and found that 167 // if the primary language detected is MRI AND if the confidence is >= 0.1, the 168 // results appear reasonably to be in te reo MÄori. 169 170 String[] sentences = sentenceDetector.sentDetect(text); 171 172 for(int i = 0; i < sentences.length; i++) { 173 String sentence = sentences[i]; 174 175 //System.err.println(sentence); 176 177 Language bestLanguage = myCategorizer.predictLanguage(sentence); 178 double confidence = bestLanguage.getConfidence(); 179 180 if(bestLanguage.getLang().equals(langCode) && confidence >= confidenceCutoff) { 181 System.err.println("Adding sentence: " + sentence + "\n"); 182 mriSentences.add(sentence); 183 } else { 184 System.err.println("SKIPPING sentence: " + sentence + "\n"); 185 } 186 } 187 return mriSentences; 188 } 189 190 64 super(silentMode, min_confidence, "mri-sent_trained.bin"); 65 } 66 67 /** 68 * Function that takes a text and returns those sentences in MÄori. 69 * @param text: the string of text from which sentences in the requested 70 * language are to be identified and returned. 71 * @return an ArrayList of sentences in the text parameter that are 72 * in the requested language. 73 */ 191 74 public ArrayList<String> getAllSentencesInMaori(String text) throws Exception { 192 75 // big assumption here: that we can split incoming text into sentences … … 220 103 } 221 104 222 /** @param langCode is 3 letter language code, ISO 639-2/3223 * https://www.loc.gov/standards/iso639-2/php/code_list.php224 * https://en.wikipedia.org/wiki/ISO_639-3225 * @return true if the input text is Maori (mri) with MINIMUM_CONFIDENCE levels of confidence (if set,226 * else DEFAULT_MINIMUM_CONFIDENCE levels of confidence).227 */228 public boolean isTextInLanguage(String langCode, String text) {229 // Get the most probable language230 Language bestLanguage = myCategorizer.predictLanguage(text);231 doPrint("Best language: " + bestLanguage.getLang());232 doPrint("Best language confidence: " + bestLanguage.getConfidence());233 234 return (bestLanguage.getLang().equals(langCode) && bestLanguage.getConfidence() >= this.MINIMUM_CONFIDENCE);235 }236 237 105 238 106 /** … … 242 110 public boolean isTextInMaori(BufferedReader reader) throws Exception { 243 111 return isTextInLanguage(MAORI_3LETTER_CODE, reader); 244 }245 /**246 * Handle "smaller" textfiles/streams of text read in.247 * Return value is the same as for isTextInLanguage(String langCode, String text);248 */249 public boolean isTextInLanguage(String langCode, BufferedReader reader) throws Exception {250 // https://stackoverflow.com/questions/326390/how-do-i-create-a-java-string-from-the-contents-of-a-file251 252 StringBuilder text = new StringBuilder();253 String line = null;254 255 256 while((line = reader.readLine()) != null) { // readLine removes newline separator257 text.append(line + "\n"); // add back (unix style) line ending258 }259 return isTextInLanguage(langCode, text.toString());260 112 } 261 113 … … 274 126 } 275 127 276 /**277 * Rudimentary attempt to deal with very large files.278 * Return value is the same as for isTextInLanguage(String langCode, String text);279 */280 public boolean isLargeTextInLanguage(String langCode, BufferedReader reader) throws Exception {281 // https://stackoverflow.com/questions/326390/how-do-i-create-a-java-string-from-the-contents-of-a-file282 283 final int NUM_LINES = 100; // arbitrary 100 lines read, predict language, calculate confidence284 285 StringBuilder text = new StringBuilder();286 String line = null;287 288 double cumulativeConfidence = 0;289 int numLoops = 0;290 291 int i = 0;292 String language = null;293 294 while((line = reader.readLine()) != null) { // readLine removes newline separator295 text.append(line + "\n"); // add back (unix style) line ending296 297 i++; // read nth line of numLoop298 299 300 if(i == NUM_LINES) { // arbitrary 100 lines read, predict language, calculate confidence301 302 303 Language bestLanguage = myCategorizer.predictLanguage(text.toString());304 if(language != null && !bestLanguage.getLang().equals(language)) { // predicted lang of current n lines not the same as predicted lang for prev n lines305 doPrintErr("**** WARNING: text seems to contain content in multiple languages or unable to consistently predict the same language.");306 }307 language = bestLanguage.getLang();308 cumulativeConfidence += bestLanguage.getConfidence();309 310 doPrintErr("Best predicted language for last " + NUM_LINES + " lines: " + language + "(confidence: " + bestLanguage.getConfidence() + ")");311 312 // finished analysing language of NUM_LINES of text313 text = new StringBuilder();314 i = 0;315 numLoops++;316 }317 }318 319 // process any (remaining) text that was less than n NUM_LINES320 if(!text.toString().equals("")) {321 text.append(line + "\n"); // add back (unix style) line ending322 i++;323 324 Language bestLanguage = myCategorizer.predictLanguage(text.toString());325 326 if(language != null && !bestLanguage.getLang().equals(language)) { // predicted lang of current n lines not the same as predicted lang for prev n lines327 doPrintErr("**** WARNING: text seems to contain content in multiple languages or unable to consistently predict the same language.");328 }329 language = bestLanguage.getLang();330 cumulativeConfidence += bestLanguage.getConfidence();331 doPrintErr("Best predicted language for final " + NUM_LINES + " lines: " + language + "(confidence: " + bestLanguage.getConfidence() + ")");332 }333 334 335 int totalLinesRead = numLoops * NUM_LINES + i; // not used336 double avgConfidence = cumulativeConfidence/(numLoops + 1); // not quite the average as the text processed outside the loop may have fewer lines than NUM_LINES337 338 339 return (language.equals(langCode) && avgConfidence >= this.MINIMUM_CONFIDENCE);340 }341 342 343 /**344 * Prints to STDOUT the predicted languages of the input text in order of descending confidence.345 * UNUSED.346 */347 public void predictedLanguages(String text) {348 // Get an array with the most probable languages349 350 Language[] languages = myCategorizer.predictLanguages(text);351 352 if(languages == null || languages.length <= 0) {353 doPrintErr("No languages predicted for the input text");354 } else {355 for(int i = 0; i < languages.length; i++) {356 doPrint("Language prediction " + i + ": " + languages[i]);357 }358 }359 360 }361 362 public void doPrint(String msg) {363 doPrint(this.silentMode, msg);364 }365 public void doPrintErr(String msg) {366 doPrintErr(this.silentMode, msg);367 }368 128 369 129 /********** STATIC METHODS *************/ … … 499 259 } 500 260 501 // TODO 502 maoriTextDetector.getAllSentencesInMaori( 503 "Primary sources ~ Published Maramataka Mo Te Tau 1885, Nepia: Te Haaringi, Kai-ta Pukapuka, kei Hehitingi Tiriti, 1884. Maramataka Mo Te Tau 1886, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1885. Maramataka Mo Te Tau 1887, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1886. Maramataka Mo Te Tau 1888, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1887. Maramataka Mo Te Tau 1889, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1888. Maramataka Mo Te Tau 1890, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1889. Maramataka Mo Te Tau 1891, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1890. Maramataka Mo Te Tau 1892, Nepia: Na te Haaringi, i ta ki tona Whare Perehi Pukapuka, 1891. Maramataka Mo Te Tau 1893, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1892. Maramataka Mo Te Tau 1894, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1893. Maramataka Me Te Tau 1895, Kihipane: Na te Muri i Ta ki tona whare perehi pukapuka, 1894. Maramataka Mo Te Tau 1896, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1895. Maramataka Mo Te Tau 1897, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka 1896. Maramataka Mo Te Tau 1898, Turanga: Na te Wiremu Hapata i ta ki Te Rau Kahikatea, 1897. Ko Te Paipera Tapu Ara, Ko Te Kawenata Tawhito Me Te Kawenata Hou, He Mea Whakamaori Mai No Nga Reo I Oroko-Tuhituhia Ai, Ranana: He mea ta ki te perehi a W.M.Watts ma te Komiti Ta Paipera mo Ingarangi mo Te Ao Katoa, 1868. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona, Me Nga Himene, Ranana: I taia tenei ki te perehi o te Komiti mo te whakapuaki i to mohiotanga ki a te Karaiti, 1858. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona. 1883. The Book of Common Prayer, and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Use of the United Church of England and Ireland: Together with the Proper Lessons for Sundays and Other Holy-Days, and a New Version of the Psalms of David, Oxford: Printed at 134 the University Press, 1852. The Book of Common Prayer and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Church of England: Together with the Psalter or Psalms of David, Printed as They Are to Be Sung or Said in Churches: And the Form and Manner of Making, Ordaining, and Consecrating of Bishops, Priests, and Deacons, London: G.E. Eyre and W. Spottiswoode, after 1871 but before 1877. Brown, A.N., The Journals of A.N. Brown C.M.S. Missionary Tauranga Covering the Years 1840 to 1842, Tauranga: The Elms Trust, 1990 (Commemorative Edition). ______________, Select Sermons of A.N. Brown, Tauranga: The Elms Trust, 1997. Fitzgerald, Caroline (ed.), Te Wiremu Henry Williams: Early Years in the North, Wellington: Huia Publishers, 2011. The Hawke's Bay Almanac, Napier: James Wood, Hawke's Bay Herald, 1862, 1863, 1867."); 261 262 //maoriTextDetector.getAllSentencesInMaori(); 504 263 505 264
Note:
See TracChangeset
for help on using the changeset viewer.