Changeset 33584 for gs3-extensions/maori-lang-detection
- Timestamp:
- 2019-10-18T21:20:39+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java
r33583 r33584 73 73 * to be in MRI. 74 74 */ 75 private final String TWO_HIGH_CONFIDENCE_MRI_SENTENCES = "Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana."; // 0.949746898829558475 private final static String TWO_HIGH_CONFIDENCE_MRI_SENTENCES = "Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana."; // 0.9497468988295584 76 76 77 77 //"Nan, kia mataara koutou. Ki te mahara koutou ki aku kupu, a ka karanga mai ki taku ingoa, I nga wa, e tata mai ai te mate; a kit e mau tonu ano hoki koutou ki te kahu aria katoa, me nga ringaringa, kua oti nei te taka e ahau mo koutou, e kore koutou e mate I te Tarakona."; // 0.7220962333610585 … … 79 79 //"E rapu ana ia i te hoari a tona Piriniha, a te kitea ki tana taha e mau ana, ka ngore noa nga turi, e haerea atu ana e te Tarakona, e karanga ana ia ki tona Kingi, otira e mea ake ana a roto i a ia, kua pahure ke te ra e karanga atu ai ia; kua whakarere hoki ia i ana kahu arai me ona ringaringa, a kahore he mea hei whakakora i a ia, tahuri noa atu ia ki te oma, hoake rawa kua kapi mai a mua ona, i nga tao o te Tarakona, a na te mea kua mahue i a ia nga takai paraihi, ngore noa ona waewae, kainga ana ia e te Tarakona. I peneitia ano hoki etahi, a ngaro noa iho ratou i te tirohanga a o ratou [6/7] hoa; ka puta te mahara ki o ratou hoa, ka pouri o ratou ngakau, otira kihai roa, kua hakari ano ratou, kua inu, kua hari, kua whakarere i o ratou kahu arai, kua wareware ano hoki ki nga kupu a to ratou Piriniha, te mahara kua patata te mate."; //0.991402350887951. 80 80 81 /** http://www.greenstone.org/ */82 private final String TWO_HIGH_CONFIDENCE_EN_SENTENCES="We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book How to build a digital library, authored by three of the group's members.";81 /** English language sentences from http://www.greenstone.org/ */ 82 private final static String TWO_HIGH_CONFIDENCE_EN_SENTENCES="We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book How to build a digital library, authored by three of the group's members."; 83 83 84 84 … … 173 173 } 174 174 175 public ArrayList<String> getAllSentencesInMaori(String text) throws Exception {175 public ArrayList<String> old_getAllSentencesInMaori(String text) throws Exception { 176 176 // big assumption here: that we can split incoming text into sentences 177 177 // for any language (using the MÄori language trained sentence model), … … 191 191 double prev_confidence = 0.0; 192 192 193 for(int i = 1; i < sentences.length; i++) { 194 String two_sentences = sentences[i-1]+" "+sentences[i]+" This is another sentence."; 195 //for(int i = 0; i < sentences.length; i++) { 196 //String two_sentences = sentences[i]; 197 193 //for(int i = 1; i < sentences.length; i++) { 194 //String two_sentences = sentences[i-1]+" "+sentences[i]+" This is another sentence."; 195 for(int i = 0; i < sentences.length; i++) { 196 String two_sentences = sentences[i]; 198 197 199 198 System.err.println(two_sentences); … … 201 200 //isTextInMaori(two_sentences) 202 201 Language bestLanguage = myCategorizer.predictLanguage(two_sentences); 203 if(bestLanguage.getLang().equals(MAORI_3LETTER_CODE)) { 204 double confidence = bestLanguage.getConfidence(); 202 double confidence = bestLanguage.getConfidence(); 203 if(bestLanguage.getLang().equals(MAORI_3LETTER_CODE) 204 && confidence >= 0.1) { 205 205 206 /* 206 207 if(prev_confidence >= this.MINIMUM_CONFIDENCE) { … … 218 219 System.err.println("Confidence for sentences up to " + i + ": " + confidence); 219 220 System.err.println(""); 221 } else { 222 System.err.println("NOT primary language - confidence: " + confidence); 220 223 } 221 224 … … 231 234 } 232 235 233 public ArrayList<String> getAllSentencesInLanguage(String langCode, String text) throws Exception { 236 237 public ArrayList<String> getAllSentencesInMaori(String text) throws Exception { 234 238 // big assumption here: that we can split incoming text into sentences 235 239 // for any language (using the MÄori language trained sentence model), … … 239 243 240 244 // we'll be storing just those sentences in text that are in MÄori. 241 ArrayList<String> mriSentences = new ArrayList<String>();245 242 246 // OpenNLP language detection works best with a minimum of 2 sentences 243 247 // See https://opennlp.apache.org/news/model-langdetect-183.html … … 251 255 252 256 String baseline = TWO_HIGH_CONFIDENCE_MRI_SENTENCES; 253 257 return getAllSentencesInLanguage(MAORI_3LETTER_CODE, baseline, text); 258 } 259 260 261 public ArrayList<String> getAllSentencesInLanguage(String langCode, String baseline, String text) throws Exception { 262 // we'll be storing just those sentences that are in the requested language code 263 ArrayList<String> mriSentences = new ArrayList<String>(); 264 254 265 Language bestLanguage = myCategorizer.predictLanguage(baseline); 255 266 if(!bestLanguage.getLang().equals(langCode)) { 256 System.err.println(" @@@@ Something's gone wrong, obvious "+MAORI_3LETTER_CODE+" language string not properly detected as "+MAORI_3LETTER_CODE+" any more.");267 System.err.println("**** WARNING: baseline string in "+langCode+" language not properly detected as "+langCode); 257 268 } 258 269 double baselineConfidence = bestLanguage.getConfidence(); … … 271 282 //System.err.println("Confidence is now " + confidence); 272 283 273 //if(!bestLanguage.getLang().equals(langCode) || confidence < this.MINIMUM_CONFIDENCE) { 274 275 // confidence should increase with added sentence in same language (or should 276 // stay about the same?) not decrease with added sentence in same language 284 // confidence in text's detected language should increase 285 // with additional sentence in same language (or should stay about the same?) 286 // not decrease with additional sentence in same language 277 287 if(bestLanguage.getLang().equals(langCode) && confidence > baselineConfidence) { 278 288 279 System.err.println("Added sentence (maintained or) increased confidence to: " + confidence); 280 281 289 System.err.println("Added sentence increased confidence to: " + confidence); 290 mriSentences.add(sentences[i]); 282 291 } 283 292 else { … … 579 588 580 589 // TODO 581 maoriTextDetector. getAllSentencesInMaori(MRI_SENTENCE_TEST);590 maoriTextDetector.old_getAllSentencesInMaori(MRI_SENTENCE_TEST); 582 591 //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, MRI_SENTENCE_TEST); 583 maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, 592 //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, TWO_HIGH_CONFIDENCE_MRI_SENTENCES, 593 maoriTextDetector.old_getAllSentencesInMaori( 584 594 "Primary sources ~ Published Maramataka Mo Te Tau 1885, Nepia: Te Haaringi, Kai-ta Pukapuka, kei Hehitingi Tiriti, 1884. Maramataka Mo Te Tau 1886, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1885. Maramataka Mo Te Tau 1887, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1886. Maramataka Mo Te Tau 1888, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1887. Maramataka Mo Te Tau 1889, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1888. Maramataka Mo Te Tau 1890, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1889. Maramataka Mo Te Tau 1891, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1890. Maramataka Mo Te Tau 1892, Nepia: Na te Haaringi, i ta ki tona Whare Perehi Pukapuka, 1891. Maramataka Mo Te Tau 1893, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1892. Maramataka Mo Te Tau 1894, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1893. Maramataka Me Te Tau 1895, Kihipane: Na te Muri i Ta ki tona whare perehi pukapuka, 1894. Maramataka Mo Te Tau 1896, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1895. Maramataka Mo Te Tau 1897, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka 1896. Maramataka Mo Te Tau 1898, Turanga: Na te Wiremu Hapata i ta ki Te Rau Kahikatea, 1897. Ko Te Paipera Tapu Ara, Ko Te Kawenata Tawhito Me Te Kawenata Hou, He Mea Whakamaori Mai No Nga Reo I Oroko-Tuhituhia Ai, Ranana: He mea ta ki te perehi a W.M.Watts ma te Komiti Ta Paipera mo Ingarangi mo Te Ao Katoa, 1868. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona, Me Nga Himene, Ranana: I taia tenei ki te perehi o te Komiti mo te whakapuaki i to mohiotanga ki a te Karaiti, 1858. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona. 1883. The Book of Common Prayer, and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Use of the United Church of England and Ireland: Together with the Proper Lessons for Sundays and Other Holy-Days, and a New Version of the Psalms of David, Oxford: Printed at 134 the University Press, 1852. The Book of Common Prayer and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Church of England: Together with the Psalter or Psalms of David, Printed as They Are to Be Sung or Said in Churches: And the Form and Manner of Making, Ordaining, and Consecrating of Bishops, Priests, and Deacons, London: G.E. Eyre and W. Spottiswoode, after 1871 but before 1877. Brown, A.N., The Journals of A.N. Brown C.M.S. Missionary Tauranga Covering the Years 1840 to 1842, Tauranga: The Elms Trust, 1990 (Commemorative Edition). ______________, Select Sermons of A.N. Brown, Tauranga: The Elms Trust, 1997. Fitzgerald, Caroline (ed.), Te Wiremu Henry Williams: Early Years in the North, Wellington: Huia Publishers, 2011. The Hawke's Bay Almanac, Napier: James Wood, Hawke's Bay Herald, 1862, 1863, 1867."); 585 595
Note:
See TracChangeset
for help on using the changeset viewer.