Changeset 33584

Show
Ignore:
Timestamp:
18.10.2019 21:20:39 (4 weeks ago)
Author:
ak19
Message:

Committing experimental version 2 using the sentence detector model, experimenting with how best to detect whether individual sentences are in Maori or not.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java

    r33583 r33584  
    7373     * to be in MRI. 
    7474     */ 
    75     private final String TWO_HIGH_CONFIDENCE_MRI_SENTENCES = "Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana."; // 0.9497468988295584 
     75    private final static String TWO_HIGH_CONFIDENCE_MRI_SENTENCES = "Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana."; // 0.9497468988295584 
    7676     
    7777    //"Nan, kia mataara koutou. Ki te mahara koutou ki aku kupu, a ka karanga mai ki taku ingoa, I nga wa, e tata mai ai te mate; a kit e mau tonu ano hoki koutou ki te kahu aria katoa, me nga ringaringa, kua oti nei te taka e ahau mo koutou, e kore koutou e mate I te Tarakona."; // 0.7220962333610585 
     
    7979    //"E rapu ana ia i te hoari a tona Piriniha, a te kitea ki tana taha e mau ana, ka ngore noa nga turi, e haerea atu ana e te Tarakona, e karanga ana ia ki tona Kingi, otira e mea ake ana a roto i a ia, kua pahure ke te ra e karanga atu ai ia; kua whakarere hoki ia i ana kahu arai me ona ringaringa, a kahore he mea hei whakakora i a ia, tahuri noa atu ia ki te oma, hoake rawa kua kapi mai a mua ona, i nga tao o te Tarakona, a na te mea kua mahue i a ia nga takai paraihi, ngore noa ona waewae, kainga ana ia e te Tarakona. I peneitia ano hoki etahi, a ngaro noa iho ratou i te tirohanga a o ratou [6/7] hoa; ka puta te mahara ki o ratou hoa, ka pouri o ratou ngakau, otira kihai roa, kua hakari ano ratou, kua inu, kua hari, kua whakarere i o ratou kahu arai, kua wareware ano hoki ki nga kupu a to ratou Piriniha, te mahara kua patata te mate."; //0.991402350887951. 
    8080 
    81     /** http://www.greenstone.org/ */ 
    82     private final String TWO_HIGH_CONFIDENCE_EN_SENTENCES="We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book How to build a digital library, authored by three of the group's members."; 
     81    /** English language sentences from http://www.greenstone.org/ */ 
     82    private final static String TWO_HIGH_CONFIDENCE_EN_SENTENCES="We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book How to build a digital library, authored by three of the group's members."; 
    8383 
    8484     
     
    173173    } 
    174174 
    175     public ArrayList<String> getAllSentencesInMaori(String text) throws Exception { 
     175    public ArrayList<String> old_getAllSentencesInMaori(String text) throws Exception { 
    176176    // big assumption here: that we can split incoming text into sentences 
    177177    // for any language (using the Māori language trained sentence model), 
     
    191191    double prev_confidence = 0.0; 
    192192     
    193     for(int i = 1; i < sentences.length; i++) { 
    194       String two_sentences = sentences[i-1]+" "+sentences[i]+" This is another sentence."; 
    195       //for(int i = 0; i < sentences.length; i++) { 
    196       //String two_sentences = sentences[i]; 
    197  
     193    //for(int i = 1; i < sentences.length; i++) { 
     194    //String two_sentences = sentences[i-1]+" "+sentences[i]+" This is another sentence."; 
     195    for(int i = 0; i < sentences.length; i++) { 
     196        String two_sentences = sentences[i]; 
    198197     
    199198        System.err.println(two_sentences);       
     
    201200        //isTextInMaori(two_sentences) 
    202201        Language bestLanguage = myCategorizer.predictLanguage(two_sentences); 
    203         if(bestLanguage.getLang().equals(MAORI_3LETTER_CODE)) { 
    204         double confidence = bestLanguage.getConfidence(); 
     202        double confidence = bestLanguage.getConfidence(); 
     203        if(bestLanguage.getLang().equals(MAORI_3LETTER_CODE) 
     204           && confidence >= 0.1) { 
     205         
    205206        /* 
    206207        if(prev_confidence >= this.MINIMUM_CONFIDENCE) { 
     
    218219        System.err.println("Confidence for sentences up to " + i + ": " + confidence); 
    219220        System.err.println(""); 
     221        } else { 
     222        System.err.println("NOT primary language - confidence: " + confidence); 
    220223        } 
    221224 
     
    231234    } 
    232235 
    233     public ArrayList<String> getAllSentencesInLanguage(String langCode, String text) throws Exception { 
     236     
     237    public ArrayList<String> getAllSentencesInMaori(String text) throws Exception { 
    234238    // big assumption here: that we can split incoming text into sentences 
    235239    // for any language (using the Māori language trained sentence model), 
     
    239243 
    240244    // we'll be storing just those sentences in text that are in Māori.  
    241     ArrayList<String> mriSentences = new ArrayList<String>(); 
     245     
    242246    // OpenNLP language detection works best with a minimum of 2 sentences 
    243247    // See https://opennlp.apache.org/news/model-langdetect-183.html 
     
    251255 
    252256    String baseline = TWO_HIGH_CONFIDENCE_MRI_SENTENCES; 
    253      
     257    return getAllSentencesInLanguage(MAORI_3LETTER_CODE, baseline, text); 
     258    } 
     259 
     260     
     261    public ArrayList<String> getAllSentencesInLanguage(String langCode, String baseline, String text) throws Exception { 
     262    // we'll be storing just those sentences that are in the requested language code 
     263    ArrayList<String> mriSentences = new ArrayList<String>(); 
     264         
    254265    Language bestLanguage = myCategorizer.predictLanguage(baseline); 
    255266    if(!bestLanguage.getLang().equals(langCode)) { 
    256         System.err.println("@@@@ Something's gone wrong, obvious "+MAORI_3LETTER_CODE+" language string not properly detected as "+MAORI_3LETTER_CODE+" any more."); 
     267        System.err.println("**** WARNING: baseline string in "+langCode+" language not properly detected as "+langCode); 
    257268    } 
    258269    double baselineConfidence = bestLanguage.getConfidence(); 
     
    271282        //System.err.println("Confidence is now " + confidence); 
    272283 
    273         //if(!bestLanguage.getLang().equals(langCode) || confidence < this.MINIMUM_CONFIDENCE) { 
    274          
    275         // confidence should increase with added sentence in same language (or should 
    276         // stay about the same?) not decrease with added sentence in same language 
     284        // confidence in text's detected language should increase 
     285        // with additional sentence in same language (or should stay about the same?) 
     286        // not decrease with additional sentence in same language 
    277287        if(bestLanguage.getLang().equals(langCode) && confidence > baselineConfidence) { 
    278288 
    279         System.err.println("Added sentence (maintained or) increased confidence to: " + confidence); 
    280          
    281          
     289        System.err.println("Added sentence increased confidence to: " + confidence); 
     290        mriSentences.add(sentences[i]);      
    282291        } 
    283292        else { 
     
    579588 
    580589        // TODO 
    581         maoriTextDetector.getAllSentencesInMaori(MRI_SENTENCE_TEST); 
     590        maoriTextDetector.old_getAllSentencesInMaori(MRI_SENTENCE_TEST); 
    582591        //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, MRI_SENTENCE_TEST); 
    583         maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE,  
     592        //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, TWO_HIGH_CONFIDENCE_MRI_SENTENCES, 
     593        maoriTextDetector.old_getAllSentencesInMaori( 
    584594                            "Primary sources ~ Published Maramataka Mo Te Tau 1885, Nepia: Te Haaringi, Kai-ta Pukapuka, kei Hehitingi Tiriti, 1884. Maramataka Mo Te Tau 1886, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1885. Maramataka Mo Te Tau 1887, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1886. Maramataka Mo Te Tau 1888, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1887. Maramataka Mo Te Tau 1889, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1888. Maramataka Mo Te Tau 1890, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1889. Maramataka Mo Te Tau 1891, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1890. Maramataka Mo Te Tau 1892, Nepia: Na te Haaringi, i ta ki tona Whare Perehi Pukapuka, 1891. Maramataka Mo Te Tau 1893, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1892. Maramataka Mo Te Tau 1894, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1893. Maramataka Me Te Tau 1895, Kihipane: Na te Muri i Ta ki tona whare perehi pukapuka, 1894. Maramataka Mo Te Tau 1896, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1895. Maramataka Mo Te Tau 1897, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka 1896. Maramataka Mo Te Tau 1898, Turanga: Na te Wiremu Hapata i ta ki Te Rau Kahikatea, 1897. Ko Te Paipera Tapu Ara, Ko Te Kawenata Tawhito Me Te Kawenata Hou, He Mea Whakamaori Mai No Nga Reo I Oroko-Tuhituhia Ai, Ranana: He mea ta ki te perehi a W.M.Watts ma te Komiti Ta Paipera mo Ingarangi mo Te Ao Katoa, 1868. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona, Me Nga Himene, Ranana: I taia tenei ki te perehi o te Komiti mo te whakapuaki i to mohiotanga ki a te Karaiti, 1858. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona. 1883. The Book of Common Prayer, and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Use of the United Church of England and Ireland: Together with the Proper Lessons for Sundays and Other Holy-Days, and a New Version of the Psalms of David, Oxford: Printed at 134 the University Press, 1852. The Book of Common Prayer and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Church of England: Together with the Psalter or Psalms of David, Printed as They Are to Be Sung or Said in Churches: And the Form and Manner of Making, Ordaining, and Consecrating of Bishops, Priests, and Deacons, London: G.E. Eyre and W. Spottiswoode, after 1871 but before 1877. Brown, A.N., The Journals of A.N. Brown C.M.S. Missionary Tauranga Covering the Years 1840 to 1842, Tauranga: The Elms Trust, 1990 (Commemorative Edition). ______________, Select Sermons of A.N. Brown, Tauranga: The Elms Trust, 1997. Fitzgerald, Caroline (ed.), Te Wiremu Henry Williams: Early Years in the North, Wellington: Huia Publishers, 2011. The Hawke's Bay Almanac, Napier: James Wood, Hawke's Bay Herald, 1862, 1863, 1867."); 
    585595