Changeset 33583
- Timestamp:
- 2019-10-18T21:20:18+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java
r33577 r33583 23 23 import java.io.*; 24 24 import opennlp.tools.langdetect.*; 25 import opennlp.tools.sentdetect.*; 25 26 import opennlp.tools.util.*; 27 28 import java.util.ArrayList; 26 29 27 30 /** … … 58 61 public final boolean silentMode; 59 62 63 private final String OPENNLP_MODELS_RELATIVE_PATH = "models" + File.separator; 64 60 65 /** Language Detection Model file for OpenNLP is expected to be at $OPENNLP_HOME/models/langdetect-183.bin */ 61 private final String LANG_DETECT_MODEL_RELATIVE_PATH = "models" + File.separator + "langdetect-183.bin"; 62 66 private final String LANG_DETECT_MODEL_RELATIVE_PATH = OPENNLP_MODELS_RELATIVE_PATH + "langdetect-183.bin"; 67 68 /** Two MÄori language sentences taken from http://anglicanhistory.org/england/swilberforce/agathos1882.html 69 * which have a reasonable/high confidence in detection. 70 * We'll use this String of 2 high confidence MRI sentences to detect whether the addition 71 * of a subsequent sentence of unknown language brings down the cumulative confidence level 72 * drastically (below DEF MIN CONF), implying that the added sentence is therefore not likely 73 * to be in MRI. 74 */ 75 private final String TWO_HIGH_CONFIDENCE_MRI_SENTENCES = "Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana."; // 0.9497468988295584 76 77 //"Nan, kia mataara koutou. Ki te mahara koutou ki aku kupu, a ka karanga mai ki taku ingoa, I nga wa, e tata mai ai te mate; a kit e mau tonu ano hoki koutou ki te kahu aria katoa, me nga ringaringa, kua oti nei te taka e ahau mo koutou, e kore koutou e mate I te Tarakona."; // 0.7220962333610585 78 79 //"E rapu ana ia i te hoari a tona Piriniha, a te kitea ki tana taha e mau ana, ka ngore noa nga turi, e haerea atu ana e te Tarakona, e karanga ana ia ki tona Kingi, otira e mea ake ana a roto i a ia, kua pahure ke te ra e karanga atu ai ia; kua whakarere hoki ia i ana kahu arai me ona ringaringa, a kahore he mea hei whakakora i a ia, tahuri noa atu ia ki te oma, hoake rawa kua kapi mai a mua ona, i nga tao o te Tarakona, a na te mea kua mahue i a ia nga takai paraihi, ngore noa ona waewae, kainga ana ia e te Tarakona. I peneitia ano hoki etahi, a ngaro noa iho ratou i te tirohanga a o ratou [6/7] hoa; ka puta te mahara ki o ratou hoa, ka pouri o ratou ngakau, otira kihai roa, kua hakari ano ratou, kua inu, kua hari, kua whakarere i o ratou kahu arai, kua wareware ano hoki ki nga kupu a to ratou Piriniha, te mahara kua patata te mate."; //0.991402350887951. 80 81 /** http://www.greenstone.org/ */ 82 private final String TWO_HIGH_CONFIDENCE_EN_SENTENCES="We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book How to build a digital library, authored by three of the group's members."; 83 84 85 /** 86 * Large chunk of text in te reo MÄori from 87 * http://anglicanhistory.org/england/swilberforce/agathos1882.html 88 * for testing the language detector 89 */ 90 private final static String MRI_SENTENCE_TEST="Meake ratou haere, ka mea atu ia ki a ratou, \"Nana, e matau ana koutou ahakoa whakaputaina mai te riri me te kaha katoa o te Tarakona ki ahau I mua ra, kihai ia I kaha, a mate ana I ahau. Me aru katoa aku hoa pono, I te ritenga kua waiho iho e ahau ki a ratou kia maia ratou, me ahau kua maia; ko reira ratou noho ai I raro-raro iho I toku torona. Pinky was here today! Na koneil I tonoa atu ai koutou e ahau, kit e whawhai ki tenei Tarakona, a k otaku kaha e haere tahi atu me koutou ki te taua. Nan, kia mataara koutou. Ki te mahara koutou ki aku kupu, a ka karanga mai ki taku ingoa, I nga wa, e tata mai ai te mate; a kit e mau tonu ano hoki koutou ki te kahu aria katoa, me nga ringaringa, kua oti nei te taka e ahau mo koutou, e kore koutou e mate I te Tarakona. Otira kit e rokohina whakaarokoretia mai koutou e ia, a kahore o koutou kahu aria e mau ana, ka mate koutou I a ia.\" [4] Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana. Ma te ata e titiro ko nga kai mataara o te po ka kaere kit e moe, a ko te hunga kua moe, oti rawa te whakakahu ki o ratou, kahu arai, me to whakam[a]tautau hoki i te koinga o nga hoari, ka karanga ratou ki te ingoa o to ratou Piriniha, a ka haere ki te whanga i te Tarakona kino. Rawe rawa ratou i konei, otira kihai ratou i mau tonu ki tenei ritenga; tiaki noa hoki ratou, a te puta te Tarakona. Marire tonu to ratou kainga. Ngaki ana te tangata whenua i a ratou mara, a ka haere ano hoki ka tata ke hauhakenga, e marena ana ratou, e tuku hakari ana, e hokohoko ana; a ka whakaaro nga hoia he teka noa pea nga rongo o te Tarakona, ka wareware haere ki te kupu o to ratou Piriniha mo te mataara, me te tupato. Na te kaha o te ra ka taimaha o ratou ringaringa; mea noa tetahi \"Ha, he aha te tikanga i maua tonutia ai tenei potae taimaha? Wera noa iho taku matenga i te whitinga iho o te ra ki tenei potae, a te kitea te Tarakona e meingatia nei, ka mahue [4/5] rawa i ahahu te potae nei ki te teneti, hei te kiteatanga at u o te Tarakona e haere mai ana, ka tiki ai. Pera noa hoki tetahik ki te arai o tona uma, me tetahi hoki ki tona arai. A na te wera o te whenua ka wera ake nga takai paraihi o o ratou waewae; mamae noa ratou, a mahue iho era, a ka marara ratou, puta noa ki tenei hakari ki tera marengatanga ranei. Kihai i matauria he hoia ratou no te Kingi, ma te rapu tonu ano ia ki tana tohu e mau ana, ka matauria ai, mahue rawa hoki te ahua i tonoa mai ai ratou e to ratou Piriniha ke te taua. Kotahi ia o ratou kihai rite ki ana hoa, ko Akatohe te ingoa, pouri raw tona ngakau ki a ratou mahi. He tini ana whakamaharatanga at ki a ratou i nga kupu a to ratou Piriniha, mea atu ana ia ki a ratou, \"Ahakoa te kitea, e koro ma, te hoa riri, tenei ano ia te patata ana; a kahore he pohehetanga o to tatou Piriniha, kua whawhai hoki ia ki te Tarakona, a kua matau ia ki tana ahua whakamataku.\" Kataina ana, tawaia ana tenei tangata maia, meinga ana ia he wawau, no te rite ana mahi ki a ratou. Otiia kiahi ia i whakarongo; a ahakoa puta o ratou kupo kino, ahakoa kah te ra o te awatea hei whakahemo i a ia, ahahkoa negenge ia i tona haerenga i te weranga o te onepu, ahakoa kuiki ia i nga huarahi o te po, kihai i mahue i a Akatohe nga kahu arai a tona Piriniha, i hoatu ai kia mau tonu i a ia; kihai hoki i mahue i a ia nga takai paraihi o ana waewae mamae, me tana mahi mataara i te po. [5/6] Roa rawa iho to ratou noho penei, a te kitea mai te hoa riri, ka kake haee o ratou kupu kino ki a ia. A mea kau ano ratou, \"he ora, kahore, kua patata te mate.\" Katahi hoki ka kitea nga tohu whakamatau, me he ai tangta hei titiro. Tera taua hoia i tenei wa e hoki mai ana i te hakari, kua hari, kua waiata, kua kanikani ratou, a kua mahue i taua hoia ana kahu arai me ana ringaringa; a tenei ia te hoki marire ana ki tana teneti i te ahi-ahi o te rangi raumati. E whakaaro haere ana ia ki ana hoa i taua hakari, ki te rawe ano hoki ona, e mhh an ki a Akatohe mona e wehi nei, e haereere tonu nei i te roro o tona teneti e pehia ana e te taimaha o ana kahu arai. E whakaaroa ana ano enei mea, ka rongo ia ki te ngaehe e puta mai ana i te motu ngahere, ki matau ona, a me te uira ano te puta whakarere mai o te Tarakona ki mua ona. E rapu ana ia i te hoari a tona Piriniha, a te kitea ki tana taha e mau ana, ka ngore noa nga turi, e haerea atu ana e te Tarakona, e karanga ana ia ki tona Kingi, otira e mea ake ana a roto i a ia, kua pahure ke te ra e karanga atu ai ia; kua whakarere hoki ia i ana kahu arai me ona ringaringa, a kahore he mea hei whakakora i a ia, tahuri noa atu ia ki te oma, hoake rawa kua kapi mai a mua ona, i nga tao o te Tarakona, a na te mea kua mahue i a ia nga takai paraihi, ngore noa ona waewae, kainga ana ia e te Tarakona. I peneitia ano hoki etahi, a ngaro noa iho ratou i te tirohanga a o ratou [6/7] hoa; ka puta te mahara ki o ratou hoa, ka pouri o ratou ngakau, otira kihai roa, kua hakari ano ratou, kua inu, kua hari, kua whakarere i o ratou kahu arai, kua wareware ano hoki ki nga kupu a to ratou Piriniha, te mahara kua patata te mate. Tera te Tarakona kua manamanangia i te matenga o etahi o nga hoia i a ia, ka whakaaroaro kia huakina putia nga toenga iho, kia kotahi ai matenga o ana hoa riri."; 91 63 92 /** 64 93 * The LanguageDetectorModel object that will do the actual language detection/prediction for us. … … 66 95 */ 67 96 private LanguageDetector myCategorizer = null; 68 97 98 /** 99 * The Sentence Detection object that does the sentence splitting for the language 100 * the sentece model was trained for. 101 */ 102 private SentenceDetectorME sentenceDetector = null; 103 69 104 /** String taken from our university website, https://www.waikato.ac.nz/maori/ */ 70 105 public static final String TEST_MRI_INPUT_TEXT = "Ko tÄnei te Whare WÄnanga o Waikato e whakatau nei i ngÄ iwi o te ao, ki roto i te riu o te awa e rere nei, ki runga i te whenua e hora nei, ki raro i te taumaru o ngÄ maunga whakaruru e tau awhi nei."; … … 77 112 this(silentMode, DEFAULT_MINIMUM_CONFIDENCE); 78 113 } 79 114 115 /** Constructor that uses the sentence Model we trained for MÄori */ 80 116 public MaoriTextDetector(boolean silentMode, double min_confidence) throws Exception { 117 this(silentMode, min_confidence, "mri-sent_trained.bin"); 118 } 119 120 /** More general constructor that can use sentence detector models for other languages */ 121 public MaoriTextDetector(boolean silentMode, double min_confidence, 122 String sentenceModelFileName) throws Exception 123 { 81 124 this.silentMode = silentMode; 82 125 this.MINIMUM_CONFIDENCE = min_confidence; … … 91 134 if(!langDetectModelBinFile.exists()) { 92 135 throw new Exception("\n\t*** " + langDetectModelBinFile.getPath() + " doesn't exist." 93 + "\n\t*** Ensure the $OPENNLP_HOME folder contains a 'models' folder with the model file 'langdetect-183.bin' in it."); 136 + "\n\t*** Ensure the $OPENNLP_HOME folder contains a 'models' folder" 137 + "\n\t*** with the model file 'langdetect-183.bin' in it."); 94 138 } 95 139 … … 109 153 110 154 // instantiating function should handle critical exceptions. Constructors shouldn't. 111 } 112 155 156 157 158 // 3. Set up our sentence model and SentenceDetector object 159 String sentenceModelPath = System.getenv("OPENNLP_HOME") + File.separator 160 + OPENNLP_MODELS_RELATIVE_PATH + sentenceModelFileName; // "mri-sent_trained.bin" default 161 File sentenceModelBinFile = new File(sentenceModelPath); 162 if(!sentenceModelBinFile.exists()) { 163 throw new Exception("\n\t*** " + sentenceModelBinFile.getPath() + " doesn't exist." 164 + "\n\t*** Ensure the $OPENNLP_HOME folder contains a 'models' folder" 165 + "\n\t*** with the model file "+sentenceModelFileName+" in it."); 166 } 167 try (InputStream modelIn = new FileInputStream(sentenceModelPath)) { 168 // https://www.tutorialspoint.com/opennlp/opennlp_sentence_detection.htm 169 SentenceModel sentenceModel = new SentenceModel(modelIn); 170 this.sentenceDetector = new SentenceDetectorME(sentenceModel); 171 172 } // instantiating function should handle this critical exception 173 } 174 175 public ArrayList<String> getAllSentencesInMaori(String text) throws Exception { 176 // big assumption here: that we can split incoming text into sentences 177 // for any language (using the MÄori language trained sentence model), 178 // despite not knowing what language those sentences are in 179 // Hinges on MRI sentences detection being similar to at least ENG equivalent 180 181 182 // we'll be storing just those sentences in text that are in MÄori. 183 ArrayList<String> mriSentences = new ArrayList<String>(); 184 // OpenNLP language detection works best with a minimum of 2 sentences 185 // See https://opennlp.apache.org/news/model-langdetect-183.html 186 // "It is important to note that this model is trained for and works well with 187 // longer texts that have at least 2 sentences or more from the same language." 188 // So we'll be attempting to detect the language working on 2 sentences at a time 189 190 String[] sentences = sentenceDetector.sentDetect(text); 191 double prev_confidence = 0.0; 192 193 for(int i = 1; i < sentences.length; i++) { 194 String two_sentences = sentences[i-1]+" "+sentences[i]+" This is another sentence."; 195 //for(int i = 0; i < sentences.length; i++) { 196 //String two_sentences = sentences[i]; 197 198 199 System.err.println(two_sentences); 200 201 //isTextInMaori(two_sentences) 202 Language bestLanguage = myCategorizer.predictLanguage(two_sentences); 203 if(bestLanguage.getLang().equals(MAORI_3LETTER_CODE)) { 204 double confidence = bestLanguage.getConfidence(); 205 /* 206 if(prev_confidence >= this.MINIMUM_CONFIDENCE) { 207 if(confidence < prev_confidence) { 208 // then the current sentence dragged down confidence 209 // and we're only confident about previous sentence 210 mriSentences.add(sentences[i-1]); 211 } else { 212 213 mriSentences.add(sentences[i]); 214 } 215 } 216 prev_confidence = confidence; 217 */ 218 System.err.println("Confidence for sentences up to " + i + ": " + confidence); 219 System.err.println(""); 220 } 221 222 /* 223 two_sentences = sentences[i] + " Pinky was here today."; 224 bestLanguage = myCategorizer.predictLanguage(two_sentences); 225 double confidence = bestLanguage.getConfidence(); 226 System.err.println("Confidence for added Pinky: " + confidence); 227 System.err.println(""); 228 */ 229 } 230 return mriSentences; 231 } 232 233 public ArrayList<String> getAllSentencesInLanguage(String langCode, String text) throws Exception { 234 // big assumption here: that we can split incoming text into sentences 235 // for any language (using the MÄori language trained sentence model), 236 // despite not knowing what language those sentences are in 237 // Hinges on MRI sentences detection being similar to at least ENG equivalent 238 239 240 // we'll be storing just those sentences in text that are in MÄori. 241 ArrayList<String> mriSentences = new ArrayList<String>(); 242 // OpenNLP language detection works best with a minimum of 2 sentences 243 // See https://opennlp.apache.org/news/model-langdetect-183.html 244 // "It is important to note that this model is trained for and works well with 245 // longer texts that have at least 2 sentences or more from the same language." 246 247 248 // we're pretty confident that the following static string is in MÄori 249 // but want to store its confidence level as baseline confidence value 250 // to compare other sentences against 251 252 String baseline = TWO_HIGH_CONFIDENCE_MRI_SENTENCES; 253 254 Language bestLanguage = myCategorizer.predictLanguage(baseline); 255 if(!bestLanguage.getLang().equals(langCode)) { 256 System.err.println("@@@@ Something's gone wrong, obvious "+MAORI_3LETTER_CODE+" language string not properly detected as "+MAORI_3LETTER_CODE+" any more."); 257 } 258 double baselineConfidence = bestLanguage.getConfidence(); 259 System.err.println("Baseline confidence: " + baselineConfidence); 260 System.err.println("----------------------------------------"); 261 262 String[] sentences = sentenceDetector.sentDetect(text); 263 264 for(int i = 0; i < sentences.length; i++) { 265 String unknownLangSentenceAppendedToBaseline = baseline+" "+sentences[i]; 266 267 System.err.println("Added sentence: " + sentences[i]); 268 269 bestLanguage = myCategorizer.predictLanguage(unknownLangSentenceAppendedToBaseline); 270 double confidence = bestLanguage.getConfidence(); 271 //System.err.println("Confidence is now " + confidence); 272 273 //if(!bestLanguage.getLang().equals(langCode) || confidence < this.MINIMUM_CONFIDENCE) { 274 275 // confidence should increase with added sentence in same language (or should 276 // stay about the same?) not decrease with added sentence in same language 277 if(bestLanguage.getLang().equals(langCode) && confidence > baselineConfidence) { 278 279 System.err.println("Added sentence (maintained or) increased confidence to: " + confidence); 280 281 282 } 283 else { 284 System.err.println("ADDED sentence not in " + langCode + " as it DECREASED confidence to: " + confidence); 285 } 286 System.err.println(""); 287 288 } 289 return mriSentences; 290 } 291 292 113 293 /** 114 294 * @return true if the input text is Maori (mri) with MINIMUM_CONFIDENCE levels of confidence (if set, … … 388 568 } 389 569 390 570 391 571 // 2. Finally, we can now do the actual language detection 392 572 try { … … 397 577 maoriTextDetector = new MaoriTextDetector(runSilent, minConfidence); 398 578 } 579 580 // TODO 581 maoriTextDetector.getAllSentencesInMaori(MRI_SENTENCE_TEST); 582 //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, MRI_SENTENCE_TEST); 583 maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, 584 "Primary sources ~ Published Maramataka Mo Te Tau 1885, Nepia: Te Haaringi, Kai-ta Pukapuka, kei Hehitingi Tiriti, 1884. Maramataka Mo Te Tau 1886, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1885. Maramataka Mo Te Tau 1887, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1886. Maramataka Mo Te Tau 1888, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1887. Maramataka Mo Te Tau 1889, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1888. Maramataka Mo Te Tau 1890, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1889. Maramataka Mo Te Tau 1891, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1890. Maramataka Mo Te Tau 1892, Nepia: Na te Haaringi, i ta ki tona Whare Perehi Pukapuka, 1891. Maramataka Mo Te Tau 1893, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1892. Maramataka Mo Te Tau 1894, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1893. Maramataka Me Te Tau 1895, Kihipane: Na te Muri i Ta ki tona whare perehi pukapuka, 1894. Maramataka Mo Te Tau 1896, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1895. Maramataka Mo Te Tau 1897, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka 1896. Maramataka Mo Te Tau 1898, Turanga: Na te Wiremu Hapata i ta ki Te Rau Kahikatea, 1897. Ko Te Paipera Tapu Ara, Ko Te Kawenata Tawhito Me Te Kawenata Hou, He Mea Whakamaori Mai No Nga Reo I Oroko-Tuhituhia Ai, Ranana: He mea ta ki te perehi a W.M.Watts ma te Komiti Ta Paipera mo Ingarangi mo Te Ao Katoa, 1868. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona, Me Nga Himene, Ranana: I taia tenei ki te perehi o te Komiti mo te whakapuaki i to mohiotanga ki a te Karaiti, 1858. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona. 1883. The Book of Common Prayer, and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Use of the United Church of England and Ireland: Together with the Proper Lessons for Sundays and Other Holy-Days, and a New Version of the Psalms of David, Oxford: Printed at 134 the University Press, 1852. The Book of Common Prayer and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Church of England: Together with the Psalter or Psalms of David, Printed as They Are to Be Sung or Said in Churches: And the Form and Manner of Making, Ordaining, and Consecrating of Bishops, Priests, and Deacons, London: G.E. Eyre and W. Spottiswoode, after 1871 but before 1877. Brown, A.N., The Journals of A.N. Brown C.M.S. Missionary Tauranga Covering the Years 1840 to 1842, Tauranga: The Elms Trust, 1990 (Commemorative Edition). ______________, Select Sermons of A.N. Brown, Tauranga: The Elms Trust, 1997. Fitzgerald, Caroline (ed.), Te Wiremu Henry Williams: Early Years in the North, Wellington: Huia Publishers, 2011. The Hawke's Bay Almanac, Napier: James Wood, Hawke's Bay Herald, 1862, 1863, 1867."); 585 399 586 400 587 //boolean textIsInMaori = maoriTextDetector.isTextInMaori(TEST_MRI_INPUT_TEXT); // test hardcoded string
Note:
See TracChangeset
for help on using the changeset viewer.