Changeset 33585 for gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java
- Timestamp:
- 2019-10-18T21:41:32+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java
r33584 r33585 66 66 private final String LANG_DETECT_MODEL_RELATIVE_PATH = OPENNLP_MODELS_RELATIVE_PATH + "langdetect-183.bin"; 67 67 68 /** Two MÄori language sentences taken from http://anglicanhistory.org/england/swilberforce/agathos1882.html69 * which have a reasonable/high confidence in detection.70 * We'll use this String of 2 high confidence MRI sentences to detect whether the addition71 * of a subsequent sentence of unknown language brings down the cumulative confidence level72 * drastically (below DEF MIN CONF), implying that the added sentence is therefore not likely73 * to be in MRI.74 */75 private final static String TWO_HIGH_CONFIDENCE_MRI_SENTENCES = "Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana."; // 0.949746898829558476 77 //"Nan, kia mataara koutou. Ki te mahara koutou ki aku kupu, a ka karanga mai ki taku ingoa, I nga wa, e tata mai ai te mate; a kit e mau tonu ano hoki koutou ki te kahu aria katoa, me nga ringaringa, kua oti nei te taka e ahau mo koutou, e kore koutou e mate I te Tarakona."; // 0.722096233361058578 79 //"E rapu ana ia i te hoari a tona Piriniha, a te kitea ki tana taha e mau ana, ka ngore noa nga turi, e haerea atu ana e te Tarakona, e karanga ana ia ki tona Kingi, otira e mea ake ana a roto i a ia, kua pahure ke te ra e karanga atu ai ia; kua whakarere hoki ia i ana kahu arai me ona ringaringa, a kahore he mea hei whakakora i a ia, tahuri noa atu ia ki te oma, hoake rawa kua kapi mai a mua ona, i nga tao o te Tarakona, a na te mea kua mahue i a ia nga takai paraihi, ngore noa ona waewae, kainga ana ia e te Tarakona. I peneitia ano hoki etahi, a ngaro noa iho ratou i te tirohanga a o ratou [6/7] hoa; ka puta te mahara ki o ratou hoa, ka pouri o ratou ngakau, otira kihai roa, kua hakari ano ratou, kua inu, kua hari, kua whakarere i o ratou kahu arai, kua wareware ano hoki ki nga kupu a to ratou Piriniha, te mahara kua patata te mate."; //0.991402350887951.80 81 /** English language sentences from http://www.greenstone.org/ */82 private final static String TWO_HIGH_CONFIDENCE_EN_SENTENCES="We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book How to build a digital library, authored by three of the group's members.";83 84 85 /**86 * Large chunk of text in te reo MÄori from87 * http://anglicanhistory.org/england/swilberforce/agathos1882.html88 * for testing the language detector89 */90 private final static String MRI_SENTENCE_TEST="Meake ratou haere, ka mea atu ia ki a ratou, \"Nana, e matau ana koutou ahakoa whakaputaina mai te riri me te kaha katoa o te Tarakona ki ahau I mua ra, kihai ia I kaha, a mate ana I ahau. Me aru katoa aku hoa pono, I te ritenga kua waiho iho e ahau ki a ratou kia maia ratou, me ahau kua maia; ko reira ratou noho ai I raro-raro iho I toku torona. Pinky was here today! Na koneil I tonoa atu ai koutou e ahau, kit e whawhai ki tenei Tarakona, a k otaku kaha e haere tahi atu me koutou ki te taua. Nan, kia mataara koutou. Ki te mahara koutou ki aku kupu, a ka karanga mai ki taku ingoa, I nga wa, e tata mai ai te mate; a kit e mau tonu ano hoki koutou ki te kahu aria katoa, me nga ringaringa, kua oti nei te taka e ahau mo koutou, e kore koutou e mate I te Tarakona. Otira kit e rokohina whakaarokoretia mai koutou e ia, a kahore o koutou kahu aria e mau ana, ka mate koutou I a ia.\" [4] Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana. Ma te ata e titiro ko nga kai mataara o te po ka kaere kit e moe, a ko te hunga kua moe, oti rawa te whakakahu ki o ratou, kahu arai, me to whakam[a]tautau hoki i te koinga o nga hoari, ka karanga ratou ki te ingoa o to ratou Piriniha, a ka haere ki te whanga i te Tarakona kino. Rawe rawa ratou i konei, otira kihai ratou i mau tonu ki tenei ritenga; tiaki noa hoki ratou, a te puta te Tarakona. Marire tonu to ratou kainga. Ngaki ana te tangata whenua i a ratou mara, a ka haere ano hoki ka tata ke hauhakenga, e marena ana ratou, e tuku hakari ana, e hokohoko ana; a ka whakaaro nga hoia he teka noa pea nga rongo o te Tarakona, ka wareware haere ki te kupu o to ratou Piriniha mo te mataara, me te tupato. Na te kaha o te ra ka taimaha o ratou ringaringa; mea noa tetahi \"Ha, he aha te tikanga i maua tonutia ai tenei potae taimaha? Wera noa iho taku matenga i te whitinga iho o te ra ki tenei potae, a te kitea te Tarakona e meingatia nei, ka mahue [4/5] rawa i ahahu te potae nei ki te teneti, hei te kiteatanga at u o te Tarakona e haere mai ana, ka tiki ai. Pera noa hoki tetahik ki te arai o tona uma, me tetahi hoki ki tona arai. A na te wera o te whenua ka wera ake nga takai paraihi o o ratou waewae; mamae noa ratou, a mahue iho era, a ka marara ratou, puta noa ki tenei hakari ki tera marengatanga ranei. Kihai i matauria he hoia ratou no te Kingi, ma te rapu tonu ano ia ki tana tohu e mau ana, ka matauria ai, mahue rawa hoki te ahua i tonoa mai ai ratou e to ratou Piriniha ke te taua. Kotahi ia o ratou kihai rite ki ana hoa, ko Akatohe te ingoa, pouri raw tona ngakau ki a ratou mahi. He tini ana whakamaharatanga at ki a ratou i nga kupu a to ratou Piriniha, mea atu ana ia ki a ratou, \"Ahakoa te kitea, e koro ma, te hoa riri, tenei ano ia te patata ana; a kahore he pohehetanga o to tatou Piriniha, kua whawhai hoki ia ki te Tarakona, a kua matau ia ki tana ahua whakamataku.\" Kataina ana, tawaia ana tenei tangata maia, meinga ana ia he wawau, no te rite ana mahi ki a ratou. Otiia kiahi ia i whakarongo; a ahakoa puta o ratou kupo kino, ahakoa kah te ra o te awatea hei whakahemo i a ia, ahahkoa negenge ia i tona haerenga i te weranga o te onepu, ahakoa kuiki ia i nga huarahi o te po, kihai i mahue i a Akatohe nga kahu arai a tona Piriniha, i hoatu ai kia mau tonu i a ia; kihai hoki i mahue i a ia nga takai paraihi o ana waewae mamae, me tana mahi mataara i te po. [5/6] Roa rawa iho to ratou noho penei, a te kitea mai te hoa riri, ka kake haee o ratou kupu kino ki a ia. A mea kau ano ratou, \"he ora, kahore, kua patata te mate.\" Katahi hoki ka kitea nga tohu whakamatau, me he ai tangta hei titiro. Tera taua hoia i tenei wa e hoki mai ana i te hakari, kua hari, kua waiata, kua kanikani ratou, a kua mahue i taua hoia ana kahu arai me ana ringaringa; a tenei ia te hoki marire ana ki tana teneti i te ahi-ahi o te rangi raumati. E whakaaro haere ana ia ki ana hoa i taua hakari, ki te rawe ano hoki ona, e mhh an ki a Akatohe mona e wehi nei, e haereere tonu nei i te roro o tona teneti e pehia ana e te taimaha o ana kahu arai. E whakaaroa ana ano enei mea, ka rongo ia ki te ngaehe e puta mai ana i te motu ngahere, ki matau ona, a me te uira ano te puta whakarere mai o te Tarakona ki mua ona. E rapu ana ia i te hoari a tona Piriniha, a te kitea ki tana taha e mau ana, ka ngore noa nga turi, e haerea atu ana e te Tarakona, e karanga ana ia ki tona Kingi, otira e mea ake ana a roto i a ia, kua pahure ke te ra e karanga atu ai ia; kua whakarere hoki ia i ana kahu arai me ona ringaringa, a kahore he mea hei whakakora i a ia, tahuri noa atu ia ki te oma, hoake rawa kua kapi mai a mua ona, i nga tao o te Tarakona, a na te mea kua mahue i a ia nga takai paraihi, ngore noa ona waewae, kainga ana ia e te Tarakona. I peneitia ano hoki etahi, a ngaro noa iho ratou i te tirohanga a o ratou [6/7] hoa; ka puta te mahara ki o ratou hoa, ka pouri o ratou ngakau, otira kihai roa, kua hakari ano ratou, kua inu, kua hari, kua whakarere i o ratou kahu arai, kua wareware ano hoki ki nga kupu a to ratou Piriniha, te mahara kua patata te mate. Tera te Tarakona kua manamanangia i te matenga o etahi o nga hoia i a ia, ka whakaaroaro kia huakina putia nga toenga iho, kia kotahi ai matenga o ana hoa riri.";91 92 68 /** 93 69 * The LanguageDetectorModel object that will do the actual language detection/prediction for us. … … 173 149 } 174 150 175 public ArrayList<String> old_getAllSentencesInMaori(String text) throws Exception { 176 // big assumption here: that we can split incoming text into sentences 177 // for any language (using the MÄori language trained sentence model), 178 // despite not knowing what language those sentences are in 179 // Hinges on MRI sentences detection being similar to at least ENG equivalent 180 181 182 // we'll be storing just those sentences in text that are in MÄori. 151 /** 152 * In this class' constructor, need to have set up the Sentence Detection Model 153 * for the langCode passed in to this function in order for the output to make 154 * sense for that language. 155 */ 156 public ArrayList<String> getAllSentencesInLanguage(String langCode, String text, double confidenceCutoff) 157 { 158 159 // we'll be storing just those sentences in text that are in the denoted language code 183 160 ArrayList<String> mriSentences = new ArrayList<String>(); 184 161 // OpenNLP language detection works best with a minimum of 2 sentences … … 186 163 // "It is important to note that this model is trained for and works well with 187 164 // longer texts that have at least 2 sentences or more from the same language." 188 // So we'll be attempting to detect the language working on 2 sentences at a time 165 166 // For evaluating single languages, I used a very small data set and found that 167 // if the primary language detected is MRI AND if the confidence is >= 0.1, the 168 // results appear reasonably to be in te reo MÄori. 189 169 190 170 String[] sentences = sentenceDetector.sentDetect(text); 191 double prev_confidence = 0.0; 192 193 //for(int i = 1; i < sentences.length; i++) { 194 //String two_sentences = sentences[i-1]+" "+sentences[i]+" This is another sentence."; 171 195 172 for(int i = 0; i < sentences.length; i++) { 196 String two_sentences = sentences[i]; 197 198 System.err.println(two_sentences); 199 200 //isTextInMaori(two_sentences) 201 Language bestLanguage = myCategorizer.predictLanguage(two_sentences); 173 String sentence = sentences[i]; 174 175 //System.err.println(sentence); 176 177 Language bestLanguage = myCategorizer.predictLanguage(sentence); 202 178 double confidence = bestLanguage.getConfidence(); 203 if(bestLanguage.getLang().equals(MAORI_3LETTER_CODE) 204 && confidence >= 0.1) { 205 206 /* 207 if(prev_confidence >= this.MINIMUM_CONFIDENCE) { 208 if(confidence < prev_confidence) { 209 // then the current sentence dragged down confidence 210 // and we're only confident about previous sentence 211 mriSentences.add(sentences[i-1]); 212 } else { 213 214 mriSentences.add(sentences[i]); 215 } 216 } 217 prev_confidence = confidence; 218 */ 219 System.err.println("Confidence for sentences up to " + i + ": " + confidence); 220 System.err.println(""); 179 180 if(bestLanguage.getLang().equals(langCode) && confidence >= confidenceCutoff) { 181 System.err.println("Adding sentence: " + sentence + "\n"); 182 mriSentences.add(sentence); 221 183 } else { 222 System.err.println("NOT primary language - confidence: " + confidence); 223 } 224 225 /* 226 two_sentences = sentences[i] + " Pinky was here today."; 227 bestLanguage = myCategorizer.predictLanguage(two_sentences); 228 double confidence = bestLanguage.getConfidence(); 229 System.err.println("Confidence for added Pinky: " + confidence); 230 System.err.println(""); 231 */ 184 System.err.println("SKIPPING sentence: " + sentence + "\n"); 185 } 232 186 } 233 187 return mriSentences; … … 249 203 // longer texts that have at least 2 sentences or more from the same language." 250 204 251 252 // we're pretty confident that the following static string is in MÄori 253 // but want to store its confidence level as baseline confidence value 254 // to compare other sentences against 255 256 String baseline = TWO_HIGH_CONFIDENCE_MRI_SENTENCES; 257 return getAllSentencesInLanguage(MAORI_3LETTER_CODE, baseline, text); 258 } 259 260 261 public ArrayList<String> getAllSentencesInLanguage(String langCode, String baseline, String text) throws Exception { 262 // we'll be storing just those sentences that are in the requested language code 263 ArrayList<String> mriSentences = new ArrayList<String>(); 264 265 Language bestLanguage = myCategorizer.predictLanguage(baseline); 266 if(!bestLanguage.getLang().equals(langCode)) { 267 System.err.println("**** WARNING: baseline string in "+langCode+" language not properly detected as "+langCode); 268 } 269 double baselineConfidence = bestLanguage.getConfidence(); 270 System.err.println("Baseline confidence: " + baselineConfidence); 271 System.err.println("----------------------------------------"); 272 273 String[] sentences = sentenceDetector.sentDetect(text); 274 275 for(int i = 0; i < sentences.length; i++) { 276 String unknownLangSentenceAppendedToBaseline = baseline+" "+sentences[i]; 277 278 System.err.println("Added sentence: " + sentences[i]); 279 280 bestLanguage = myCategorizer.predictLanguage(unknownLangSentenceAppendedToBaseline); 281 double confidence = bestLanguage.getConfidence(); 282 //System.err.println("Confidence is now " + confidence); 283 284 // confidence in text's detected language should increase 285 // with additional sentence in same language (or should stay about the same?) 286 // not decrease with additional sentence in same language 287 if(bestLanguage.getLang().equals(langCode) && confidence > baselineConfidence) { 288 289 System.err.println("Added sentence increased confidence to: " + confidence); 290 mriSentences.add(sentences[i]); 291 } 292 else { 293 System.err.println("ADDED sentence not in " + langCode + " as it DECREASED confidence to: " + confidence); 294 } 295 System.err.println(""); 296 297 } 298 return mriSentences; 205 // For evaluating single languages, I used a very small data set and found that 206 // if the primary language detected is MRI AND if the confidence is >= 0.1, the 207 // results appear reasonably to be in te reo MÄori. 208 209 final double confidenceCutoff = 0.1; 210 return getAllSentencesInLanguage(MAORI_3LETTER_CODE, text, confidenceCutoff); 299 211 } 300 212 … … 588 500 589 501 // TODO 590 maoriTextDetector.old_getAllSentencesInMaori(MRI_SENTENCE_TEST); 591 //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, MRI_SENTENCE_TEST); 592 //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, TWO_HIGH_CONFIDENCE_MRI_SENTENCES, 593 maoriTextDetector.old_getAllSentencesInMaori( 502 maoriTextDetector.getAllSentencesInMaori( 594 503 "Primary sources ~ Published Maramataka Mo Te Tau 1885, Nepia: Te Haaringi, Kai-ta Pukapuka, kei Hehitingi Tiriti, 1884. Maramataka Mo Te Tau 1886, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1885. Maramataka Mo Te Tau 1887, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1886. Maramataka Mo Te Tau 1888, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1887. Maramataka Mo Te Tau 1889, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1888. Maramataka Mo Te Tau 1890, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1889. Maramataka Mo Te Tau 1891, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1890. Maramataka Mo Te Tau 1892, Nepia: Na te Haaringi, i ta ki tona Whare Perehi Pukapuka, 1891. Maramataka Mo Te Tau 1893, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1892. Maramataka Mo Te Tau 1894, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1893. Maramataka Me Te Tau 1895, Kihipane: Na te Muri i Ta ki tona whare perehi pukapuka, 1894. Maramataka Mo Te Tau 1896, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1895. Maramataka Mo Te Tau 1897, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka 1896. Maramataka Mo Te Tau 1898, Turanga: Na te Wiremu Hapata i ta ki Te Rau Kahikatea, 1897. Ko Te Paipera Tapu Ara, Ko Te Kawenata Tawhito Me Te Kawenata Hou, He Mea Whakamaori Mai No Nga Reo I Oroko-Tuhituhia Ai, Ranana: He mea ta ki te perehi a W.M.Watts ma te Komiti Ta Paipera mo Ingarangi mo Te Ao Katoa, 1868. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona, Me Nga Himene, Ranana: I taia tenei ki te perehi o te Komiti mo te whakapuaki i to mohiotanga ki a te Karaiti, 1858. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona. 1883. The Book of Common Prayer, and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Use of the United Church of England and Ireland: Together with the Proper Lessons for Sundays and Other Holy-Days, and a New Version of the Psalms of David, Oxford: Printed at 134 the University Press, 1852. The Book of Common Prayer and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Church of England: Together with the Psalter or Psalms of David, Printed as They Are to Be Sung or Said in Churches: And the Form and Manner of Making, Ordaining, and Consecrating of Bishops, Priests, and Deacons, London: G.E. Eyre and W. Spottiswoode, after 1871 but before 1877. Brown, A.N., The Journals of A.N. Brown C.M.S. Missionary Tauranga Covering the Years 1840 to 1842, Tauranga: The Elms Trust, 1990 (Commemorative Edition). ______________, Select Sermons of A.N. Brown, Tauranga: The Elms Trust, 1997. Fitzgerald, Caroline (ed.), Te Wiremu Henry Williams: Early Years in the North, Wellington: Huia Publishers, 2011. The Hawke's Bay Almanac, Napier: James Wood, Hawke's Bay Herald, 1862, 1863, 1867."); 595 504
Note:
See TracChangeset
for help on using the changeset viewer.