Changeset 33585

Show
Ignore:
Timestamp:
18.10.2019 21:41:32 (4 weeks ago)
Author:
ak19
Message:

Much simpler way of using sentence and language detection model to work on a single sentence at a time. Not sure if it is truly best way, but at least as good or better than my older attempts. Committing with debugging.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java

    r33584 r33585  
    6666    private final String LANG_DETECT_MODEL_RELATIVE_PATH = OPENNLP_MODELS_RELATIVE_PATH + "langdetect-183.bin"; 
    6767 
    68     /** Two Māori language sentences taken from http://anglicanhistory.org/england/swilberforce/agathos1882.html 
    69      * which have a reasonable/high confidence in detection. 
    70      * We'll use this String of 2 high confidence MRI sentences to detect whether the addition 
    71      * of a subsequent sentence of unknown language brings down the cumulative confidence level 
    72      * drastically (below DEF MIN CONF), implying that the added sentence is therefore not likely 
    73      * to be in MRI. 
    74      */ 
    75     private final static String TWO_HIGH_CONFIDENCE_MRI_SENTENCES = "Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana."; // 0.9497468988295584 
    76      
    77     //"Nan, kia mataara koutou. Ki te mahara koutou ki aku kupu, a ka karanga mai ki taku ingoa, I nga wa, e tata mai ai te mate; a kit e mau tonu ano hoki koutou ki te kahu aria katoa, me nga ringaringa, kua oti nei te taka e ahau mo koutou, e kore koutou e mate I te Tarakona."; // 0.7220962333610585 
    78      
    79     //"E rapu ana ia i te hoari a tona Piriniha, a te kitea ki tana taha e mau ana, ka ngore noa nga turi, e haerea atu ana e te Tarakona, e karanga ana ia ki tona Kingi, otira e mea ake ana a roto i a ia, kua pahure ke te ra e karanga atu ai ia; kua whakarere hoki ia i ana kahu arai me ona ringaringa, a kahore he mea hei whakakora i a ia, tahuri noa atu ia ki te oma, hoake rawa kua kapi mai a mua ona, i nga tao o te Tarakona, a na te mea kua mahue i a ia nga takai paraihi, ngore noa ona waewae, kainga ana ia e te Tarakona. I peneitia ano hoki etahi, a ngaro noa iho ratou i te tirohanga a o ratou [6/7] hoa; ka puta te mahara ki o ratou hoa, ka pouri o ratou ngakau, otira kihai roa, kua hakari ano ratou, kua inu, kua hari, kua whakarere i o ratou kahu arai, kua wareware ano hoki ki nga kupu a to ratou Piriniha, te mahara kua patata te mate."; //0.991402350887951. 
    80  
    81     /** English language sentences from http://www.greenstone.org/ */ 
    82     private final static String TWO_HIGH_CONFIDENCE_EN_SENTENCES="We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book How to build a digital library, authored by three of the group's members."; 
    83  
    84      
    85     /**  
    86      * Large chunk of text in te reo Māori from 
    87      * http://anglicanhistory.org/england/swilberforce/agathos1882.html 
    88      * for testing the language detector 
    89      */ 
    90     private final static String MRI_SENTENCE_TEST="Meake ratou haere, ka mea atu ia ki a ratou, \"Nana, e matau ana koutou ahakoa whakaputaina mai te riri me te kaha katoa o te Tarakona ki ahau I mua ra, kihai ia I kaha, a mate ana I ahau. Me aru katoa aku hoa pono, I te ritenga kua waiho iho e ahau ki a ratou kia maia ratou, me ahau kua maia; ko reira ratou noho ai I raro-raro iho I toku torona. Pinky was here today! Na koneil I tonoa atu ai koutou e ahau, kit e whawhai ki tenei Tarakona, a k otaku kaha e haere tahi atu me koutou ki te taua. Nan, kia mataara koutou. Ki te mahara koutou ki aku kupu, a ka karanga mai ki taku ingoa, I nga wa, e tata mai ai te mate; a kit e mau tonu ano hoki koutou ki te kahu aria katoa, me nga ringaringa, kua oti nei te taka e ahau mo koutou, e kore koutou e mate I te Tarakona. Otira kit e rokohina whakaarokoretia mai koutou e ia, a kahore o koutou kahu aria e mau ana, ka mate koutou I a ia.\" [4] Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana. Ma te ata e titiro ko nga kai mataara o te po ka kaere kit e moe, a ko te hunga kua moe, oti rawa te whakakahu ki o ratou, kahu arai, me to whakam[a]tautau hoki i te koinga o nga hoari, ka karanga ratou ki te ingoa o to ratou Piriniha, a ka haere ki te whanga i te Tarakona kino. Rawe rawa ratou i konei, otira kihai ratou i mau tonu ki tenei ritenga; tiaki noa hoki ratou, a te puta te Tarakona. Marire tonu to ratou kainga. Ngaki ana te tangata whenua i a ratou mara, a ka haere ano hoki ka tata ke hauhakenga, e marena ana ratou, e tuku hakari ana, e hokohoko ana; a ka whakaaro nga hoia he teka noa pea nga rongo o te Tarakona, ka wareware haere ki te kupu o to ratou Piriniha mo te mataara, me te tupato. Na te kaha o te ra ka taimaha o ratou ringaringa; mea noa tetahi \"Ha, he aha te tikanga i maua tonutia ai tenei potae taimaha? Wera noa iho taku matenga i te whitinga iho o te ra ki tenei potae, a te kitea te Tarakona e meingatia nei, ka mahue [4/5] rawa i ahahu te potae nei ki te teneti, hei te kiteatanga at u o te Tarakona e haere mai ana, ka tiki ai. Pera noa hoki tetahik ki te arai o tona uma, me tetahi hoki ki tona arai. A na te wera o te whenua ka wera ake nga takai paraihi o o ratou waewae; mamae noa ratou, a mahue iho era, a ka marara ratou, puta noa ki tenei hakari ki tera marengatanga ranei. Kihai i matauria he hoia ratou no te Kingi, ma te rapu tonu ano ia ki tana tohu e mau ana, ka matauria ai, mahue rawa hoki te ahua i tonoa mai ai ratou e to ratou Piriniha ke te taua. Kotahi ia o ratou kihai rite ki ana hoa, ko Akatohe te ingoa, pouri raw tona ngakau ki a ratou mahi. He tini ana whakamaharatanga at ki a ratou i nga kupu a to ratou Piriniha, mea atu ana ia ki a ratou, \"Ahakoa te kitea, e koro ma, te hoa riri, tenei ano ia te patata ana; a kahore he pohehetanga o to tatou Piriniha, kua whawhai hoki ia ki te Tarakona, a kua matau ia ki tana ahua whakamataku.\" Kataina ana, tawaia ana tenei tangata maia, meinga ana ia he wawau, no te rite ana mahi ki a ratou. Otiia kiahi ia i whakarongo; a ahakoa puta o ratou kupo kino, ahakoa kah te ra o te awatea hei whakahemo i a ia, ahahkoa negenge ia i tona haerenga i te weranga o te onepu, ahakoa kuiki ia i nga huarahi o te po, kihai i mahue i a Akatohe nga kahu arai a tona Piriniha, i hoatu ai kia mau tonu i a ia; kihai hoki i mahue i a ia nga takai paraihi o ana waewae mamae, me tana mahi mataara i te po. [5/6] Roa rawa iho to ratou noho penei, a te kitea mai te hoa riri, ka kake haee o ratou kupu kino ki a ia. A mea kau ano ratou, \"he ora, kahore, kua patata te mate.\" Katahi hoki ka kitea nga tohu whakamatau, me he ai tangta hei titiro. Tera taua hoia i tenei wa e hoki mai ana i te hakari, kua hari, kua waiata, kua kanikani ratou, a kua mahue i taua hoia ana kahu arai me ana ringaringa; a tenei ia te hoki marire ana ki tana teneti i te ahi-ahi o te rangi raumati. E whakaaro haere ana ia ki ana hoa i taua hakari, ki te rawe ano hoki ona, e mhh an ki a Akatohe mona e wehi nei, e haereere tonu nei i te roro o tona teneti e pehia ana e te taimaha o ana kahu arai. E whakaaroa ana ano enei mea, ka rongo ia ki te ngaehe e puta mai ana i te motu ngahere, ki matau ona, a me te uira ano te puta whakarere mai o te Tarakona ki mua ona. E rapu ana ia i te hoari a tona Piriniha, a te kitea ki tana taha e mau ana, ka ngore noa nga turi, e haerea atu ana e te Tarakona, e karanga ana ia ki tona Kingi, otira e mea ake ana a roto i a ia, kua pahure ke te ra e karanga atu ai ia; kua whakarere hoki ia i ana kahu arai me ona ringaringa, a kahore he mea hei whakakora i a ia, tahuri noa atu ia ki te oma, hoake rawa kua kapi mai a mua ona, i nga tao o te Tarakona, a na te mea kua mahue i a ia nga takai paraihi, ngore noa ona waewae, kainga ana ia e te Tarakona. I peneitia ano hoki etahi, a ngaro noa iho ratou i te tirohanga a o ratou [6/7] hoa; ka puta te mahara ki o ratou hoa, ka pouri o ratou ngakau, otira kihai roa, kua hakari ano ratou, kua inu, kua hari, kua whakarere i o ratou kahu arai, kua wareware ano hoki ki nga kupu a to ratou Piriniha, te mahara kua patata te mate. Tera te Tarakona kua manamanangia i te matenga o etahi o nga hoia i a ia, ka whakaaroaro kia huakina putia nga toenga iho, kia kotahi ai matenga o ana hoa riri."; 
    91      
    9268    /** 
    9369     * The LanguageDetectorModel object that will do the actual language detection/prediction for us. 
     
    173149    } 
    174150 
    175     public ArrayList<String> old_getAllSentencesInMaori(String text) throws Exception { 
    176     // big assumption here: that we can split incoming text into sentences 
    177     // for any language (using the Māori language trained sentence model), 
    178     // despite not knowing what language those sentences are in 
    179     // Hinges on MRI sentences detection being similar to at least ENG equivalent 
    180  
    181  
    182     // we'll be storing just those sentences in text that are in Māori.  
     151    /**  
     152     * In this class' constructor, need to have set up the Sentence Detection Model 
     153     * for the langCode passed in to this function in order for the output to make 
     154     * sense for that language. 
     155     */ 
     156    public ArrayList<String> getAllSentencesInLanguage(String langCode, String text, double confidenceCutoff) 
     157    { 
     158 
     159    // we'll be storing just those sentences in text that are in the denoted language code 
    183160    ArrayList<String> mriSentences = new ArrayList<String>(); 
    184161    // OpenNLP language detection works best with a minimum of 2 sentences 
     
    186163    // "It is important to note that this model is trained for and works well with 
    187164    // longer texts that have at least 2 sentences or more from the same language." 
    188     // So we'll be attempting to detect the language working on 2 sentences at a time 
     165     
     166    // For evaluating single languages, I used a very small data set and found that 
     167    // if the primary language detected is MRI AND if the confidence is >= 0.1, the 
     168    // results appear reasonably to be in te reo Māori. 
    189169     
    190170    String[] sentences = sentenceDetector.sentDetect(text); 
    191     double prev_confidence = 0.0; 
    192      
    193     //for(int i = 1; i < sentences.length; i++) { 
    194     //String two_sentences = sentences[i-1]+" "+sentences[i]+" This is another sentence."; 
     171     
    195172    for(int i = 0; i < sentences.length; i++) { 
    196         String two_sentences = sentences[i]; 
    197      
    198         System.err.println(two_sentences);       
    199  
    200         //isTextInMaori(two_sentences) 
    201         Language bestLanguage = myCategorizer.predictLanguage(two_sentences); 
     173        String sentence = sentences[i];      
     174         
     175        //System.err.println(sentence); 
     176 
     177        Language bestLanguage = myCategorizer.predictLanguage(sentence); 
    202178        double confidence = bestLanguage.getConfidence(); 
    203         if(bestLanguage.getLang().equals(MAORI_3LETTER_CODE) 
    204            && confidence >= 0.1) { 
    205          
    206         /* 
    207         if(prev_confidence >= this.MINIMUM_CONFIDENCE) { 
    208             if(confidence < prev_confidence) { 
    209             // then the current sentence dragged down confidence 
    210             // and we're only confident about previous sentence 
    211             mriSentences.add(sentences[i-1]); 
    212             } else { 
    213              
    214             mriSentences.add(sentences[i]); 
    215             } 
    216         } 
    217         prev_confidence = confidence; 
    218         */ 
    219         System.err.println("Confidence for sentences up to " + i + ": " + confidence); 
    220         System.err.println(""); 
     179         
     180        if(bestLanguage.getLang().equals(langCode) && confidence >= confidenceCutoff) { 
     181        System.err.println("Adding sentence: " + sentence + "\n"); 
     182        mriSentences.add(sentence);      
    221183        } else { 
    222         System.err.println("NOT primary language - confidence: " + confidence); 
    223         } 
    224  
    225         /* 
    226         two_sentences = sentences[i] + " Pinky was here today."; 
    227         bestLanguage = myCategorizer.predictLanguage(two_sentences); 
    228         double confidence = bestLanguage.getConfidence(); 
    229         System.err.println("Confidence for added Pinky: " + confidence); 
    230         System.err.println(""); 
    231         */ 
     184        System.err.println("SKIPPING sentence: " + sentence + "\n"); 
     185        } 
    232186    } 
    233187    return mriSentences; 
     
    249203    // longer texts that have at least 2 sentences or more from the same language." 
    250204     
    251      
    252     // we're pretty confident that the following static string is in Māori 
    253     // but want to store its confidence level as baseline confidence value 
    254     // to compare other sentences against 
    255  
    256     String baseline = TWO_HIGH_CONFIDENCE_MRI_SENTENCES; 
    257     return getAllSentencesInLanguage(MAORI_3LETTER_CODE, baseline, text); 
    258     } 
    259  
    260      
    261     public ArrayList<String> getAllSentencesInLanguage(String langCode, String baseline, String text) throws Exception { 
    262     // we'll be storing just those sentences that are in the requested language code 
    263     ArrayList<String> mriSentences = new ArrayList<String>(); 
    264          
    265     Language bestLanguage = myCategorizer.predictLanguage(baseline); 
    266     if(!bestLanguage.getLang().equals(langCode)) { 
    267         System.err.println("**** WARNING: baseline string in "+langCode+" language not properly detected as "+langCode); 
    268     } 
    269     double baselineConfidence = bestLanguage.getConfidence(); 
    270     System.err.println("Baseline confidence: " + baselineConfidence); 
    271     System.err.println("----------------------------------------"); 
    272      
    273     String[] sentences = sentenceDetector.sentDetect(text); 
    274      
    275     for(int i = 0; i < sentences.length; i++) { 
    276         String unknownLangSentenceAppendedToBaseline = baseline+" "+sentences[i]; 
    277  
    278         System.err.println("Added sentence: " + sentences[i]); 
    279          
    280         bestLanguage = myCategorizer.predictLanguage(unknownLangSentenceAppendedToBaseline); 
    281         double confidence = bestLanguage.getConfidence(); 
    282         //System.err.println("Confidence is now " + confidence); 
    283  
    284         // confidence in text's detected language should increase 
    285         // with additional sentence in same language (or should stay about the same?) 
    286         // not decrease with additional sentence in same language 
    287         if(bestLanguage.getLang().equals(langCode) && confidence > baselineConfidence) { 
    288  
    289         System.err.println("Added sentence increased confidence to: " + confidence); 
    290         mriSentences.add(sentences[i]);      
    291         } 
    292         else { 
    293         System.err.println("ADDED sentence not in " + langCode + " as it DECREASED confidence to: " + confidence); 
    294         } 
    295         System.err.println(""); 
    296          
    297     } 
    298     return mriSentences; 
     205    // For evaluating single languages, I used a very small data set and found that 
     206    // if the primary language detected is MRI AND if the confidence is >= 0.1, the 
     207    // results appear reasonably to be in te reo Māori. 
     208     
     209    final double confidenceCutoff = 0.1; 
     210    return getAllSentencesInLanguage(MAORI_3LETTER_CODE, text, confidenceCutoff); 
    299211    } 
    300212 
     
    588500 
    589501        // TODO 
    590         maoriTextDetector.old_getAllSentencesInMaori(MRI_SENTENCE_TEST); 
    591         //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, MRI_SENTENCE_TEST); 
    592         //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, TWO_HIGH_CONFIDENCE_MRI_SENTENCES, 
    593         maoriTextDetector.old_getAllSentencesInMaori( 
     502        maoriTextDetector.getAllSentencesInMaori( 
    594503                            "Primary sources ~ Published Maramataka Mo Te Tau 1885, Nepia: Te Haaringi, Kai-ta Pukapuka, kei Hehitingi Tiriti, 1884. Maramataka Mo Te Tau 1886, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1885. Maramataka Mo Te Tau 1887, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1886. Maramataka Mo Te Tau 1888, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1887. Maramataka Mo Te Tau 1889, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1888. Maramataka Mo Te Tau 1890, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1889. Maramataka Mo Te Tau 1891, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1890. Maramataka Mo Te Tau 1892, Nepia: Na te Haaringi, i ta ki tona Whare Perehi Pukapuka, 1891. Maramataka Mo Te Tau 1893, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1892. Maramataka Mo Te Tau 1894, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1893. Maramataka Me Te Tau 1895, Kihipane: Na te Muri i Ta ki tona whare perehi pukapuka, 1894. Maramataka Mo Te Tau 1896, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1895. Maramataka Mo Te Tau 1897, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka 1896. Maramataka Mo Te Tau 1898, Turanga: Na te Wiremu Hapata i ta ki Te Rau Kahikatea, 1897. Ko Te Paipera Tapu Ara, Ko Te Kawenata Tawhito Me Te Kawenata Hou, He Mea Whakamaori Mai No Nga Reo I Oroko-Tuhituhia Ai, Ranana: He mea ta ki te perehi a W.M.Watts ma te Komiti Ta Paipera mo Ingarangi mo Te Ao Katoa, 1868. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona, Me Nga Himene, Ranana: I taia tenei ki te perehi o te Komiti mo te whakapuaki i to mohiotanga ki a te Karaiti, 1858. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona. 1883. The Book of Common Prayer, and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Use of the United Church of England and Ireland: Together with the Proper Lessons for Sundays and Other Holy-Days, and a New Version of the Psalms of David, Oxford: Printed at 134 the University Press, 1852. The Book of Common Prayer and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Church of England: Together with the Psalter or Psalms of David, Printed as They Are to Be Sung or Said in Churches: And the Form and Manner of Making, Ordaining, and Consecrating of Bishops, Priests, and Deacons, London: G.E. Eyre and W. Spottiswoode, after 1871 but before 1877. Brown, A.N., The Journals of A.N. Brown C.M.S. Missionary Tauranga Covering the Years 1840 to 1842, Tauranga: The Elms Trust, 1990 (Commemorative Edition). ______________, Select Sermons of A.N. Brown, Tauranga: The Elms Trust, 1997. Fitzgerald, Caroline (ed.), Te Wiremu Henry Williams: Early Years in the North, Wellington: Huia Publishers, 2011. The Hawke's Bay Almanac, Napier: James Wood, Hawke's Bay Herald, 1862, 1863, 1867."); 
    595504