Changeset 33584


Ignore:
Timestamp:
2019-10-18T21:20:39+13:00 (5 years ago)
Author:
ak19
Message:

Committing experimental version 2 using the sentence detector model, experimenting with how best to detect whether individual sentences are in Maori or not.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java

    r33583 r33584  
    7373     * to be in MRI.
    7474     */
    75     private final String TWO_HIGH_CONFIDENCE_MRI_SENTENCES = "Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana."; // 0.9497468988295584
     75    private final static String TWO_HIGH_CONFIDENCE_MRI_SENTENCES = "Hohoro tonu te whakaae me te haere katoa o nga hoia, kit e wahi e whakamatea nei e te Tarakona. E hou ana to ratou taenga atu, mataara tonu ratou, a kihai I mahue o ratou kahu arai; ka moe etahi ka ara etahi kit e whanga; ano to hunga e tu ana, rawe rawa I te kanapa o a ratou kahu arai, me o ratou ringaringa; hari tonu te tangata kainga no te mea kei waenga pu I a ratou nga hoia o te Kingi e whanga ana."; // 0.9497468988295584
    7676   
    7777    //"Nan, kia mataara koutou. Ki te mahara koutou ki aku kupu, a ka karanga mai ki taku ingoa, I nga wa, e tata mai ai te mate; a kit e mau tonu ano hoki koutou ki te kahu aria katoa, me nga ringaringa, kua oti nei te taka e ahau mo koutou, e kore koutou e mate I te Tarakona."; // 0.7220962333610585
     
    7979    //"E rapu ana ia i te hoari a tona Piriniha, a te kitea ki tana taha e mau ana, ka ngore noa nga turi, e haerea atu ana e te Tarakona, e karanga ana ia ki tona Kingi, otira e mea ake ana a roto i a ia, kua pahure ke te ra e karanga atu ai ia; kua whakarere hoki ia i ana kahu arai me ona ringaringa, a kahore he mea hei whakakora i a ia, tahuri noa atu ia ki te oma, hoake rawa kua kapi mai a mua ona, i nga tao o te Tarakona, a na te mea kua mahue i a ia nga takai paraihi, ngore noa ona waewae, kainga ana ia e te Tarakona. I peneitia ano hoki etahi, a ngaro noa iho ratou i te tirohanga a o ratou [6/7] hoa; ka puta te mahara ki o ratou hoa, ka pouri o ratou ngakau, otira kihai roa, kua hakari ano ratou, kua inu, kua hari, kua whakarere i o ratou kahu arai, kua wareware ano hoki ki nga kupu a to ratou Piriniha, te mahara kua patata te mate."; //0.991402350887951.
    8080
    81     /** http://www.greenstone.org/ */
    82     private final String TWO_HIGH_CONFIDENCE_EN_SENTENCES="We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book How to build a digital library, authored by three of the group's members.";
     81    /** English language sentences from http://www.greenstone.org/ */
     82    private final static String TWO_HIGH_CONFIDENCE_EN_SENTENCES="We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book How to build a digital library, authored by three of the group's members.";
    8383
    8484   
     
    173173    }
    174174
    175     public ArrayList<String> getAllSentencesInMaori(String text) throws Exception {
     175    public ArrayList<String> old_getAllSentencesInMaori(String text) throws Exception {
    176176    // big assumption here: that we can split incoming text into sentences
    177177    // for any language (using the Māori language trained sentence model),
     
    191191    double prev_confidence = 0.0;
    192192   
    193     for(int i = 1; i < sentences.length; i++) {
    194       String two_sentences = sentences[i-1]+" "+sentences[i]+" This is another sentence.";
    195       //for(int i = 0; i < sentences.length; i++) {
    196       //String two_sentences = sentences[i];
    197 
     193    //for(int i = 1; i < sentences.length; i++) {
     194    //String two_sentences = sentences[i-1]+" "+sentences[i]+" This is another sentence.";
     195    for(int i = 0; i < sentences.length; i++) {
     196        String two_sentences = sentences[i];
    198197   
    199198        System.err.println(two_sentences);     
     
    201200        //isTextInMaori(two_sentences)
    202201        Language bestLanguage = myCategorizer.predictLanguage(two_sentences);
    203         if(bestLanguage.getLang().equals(MAORI_3LETTER_CODE)) {
    204         double confidence = bestLanguage.getConfidence();
     202        double confidence = bestLanguage.getConfidence();
     203        if(bestLanguage.getLang().equals(MAORI_3LETTER_CODE)
     204           && confidence >= 0.1) {
     205       
    205206        /*
    206207        if(prev_confidence >= this.MINIMUM_CONFIDENCE) {
     
    218219        System.err.println("Confidence for sentences up to " + i + ": " + confidence);
    219220        System.err.println("");
     221        } else {
     222        System.err.println("NOT primary language - confidence: " + confidence);
    220223        }
    221224
     
    231234    }
    232235
    233     public ArrayList<String> getAllSentencesInLanguage(String langCode, String text) throws Exception {
     236   
     237    public ArrayList<String> getAllSentencesInMaori(String text) throws Exception {
    234238    // big assumption here: that we can split incoming text into sentences
    235239    // for any language (using the Māori language trained sentence model),
     
    239243
    240244    // we'll be storing just those sentences in text that are in Māori.
    241     ArrayList<String> mriSentences = new ArrayList<String>();
     245   
    242246    // OpenNLP language detection works best with a minimum of 2 sentences
    243247    // See https://opennlp.apache.org/news/model-langdetect-183.html
     
    251255
    252256    String baseline = TWO_HIGH_CONFIDENCE_MRI_SENTENCES;
    253    
     257    return getAllSentencesInLanguage(MAORI_3LETTER_CODE, baseline, text);
     258    }
     259
     260   
     261    public ArrayList<String> getAllSentencesInLanguage(String langCode, String baseline, String text) throws Exception {
     262    // we'll be storing just those sentences that are in the requested language code
     263    ArrayList<String> mriSentences = new ArrayList<String>();
     264       
    254265    Language bestLanguage = myCategorizer.predictLanguage(baseline);
    255266    if(!bestLanguage.getLang().equals(langCode)) {
    256         System.err.println("@@@@ Something's gone wrong, obvious "+MAORI_3LETTER_CODE+" language string not properly detected as "+MAORI_3LETTER_CODE+" any more.");
     267        System.err.println("**** WARNING: baseline string in "+langCode+" language not properly detected as "+langCode);
    257268    }
    258269    double baselineConfidence = bestLanguage.getConfidence();
     
    271282        //System.err.println("Confidence is now " + confidence);
    272283
    273         //if(!bestLanguage.getLang().equals(langCode) || confidence < this.MINIMUM_CONFIDENCE) {
    274        
    275         // confidence should increase with added sentence in same language (or should
    276         // stay about the same?) not decrease with added sentence in same language
     284        // confidence in text's detected language should increase
     285        // with additional sentence in same language (or should stay about the same?)
     286        // not decrease with additional sentence in same language
    277287        if(bestLanguage.getLang().equals(langCode) && confidence > baselineConfidence) {
    278288
    279         System.err.println("Added sentence (maintained or) increased confidence to: " + confidence);
    280        
    281        
     289        System.err.println("Added sentence increased confidence to: " + confidence);
     290        mriSentences.add(sentences[i]);     
    282291        }
    283292        else {
     
    579588
    580589        // TODO
    581         maoriTextDetector.getAllSentencesInMaori(MRI_SENTENCE_TEST);
     590        maoriTextDetector.old_getAllSentencesInMaori(MRI_SENTENCE_TEST);
    582591        //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, MRI_SENTENCE_TEST);
    583         maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE,
     592        //maoriTextDetector.getAllSentencesInLanguage(MAORI_3LETTER_CODE, TWO_HIGH_CONFIDENCE_MRI_SENTENCES,
     593        maoriTextDetector.old_getAllSentencesInMaori(
    584594                            "Primary sources ~ Published Maramataka Mo Te Tau 1885, Nepia: Te Haaringi, Kai-ta Pukapuka, kei Hehitingi Tiriti, 1884. Maramataka Mo Te Tau 1886, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1885. Maramataka Mo Te Tau 1887, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1886. Maramataka Mo Te Tau 1888, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1887. Maramataka Mo Te Tau 1889, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1888. Maramataka Mo Te Tau 1890, Nepia: Na te Haaringi i ta ki tona Whare Perehi Pukapuka, 1889. Maramataka Mo Te Tau 1891, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1890. Maramataka Mo Te Tau 1892, Nepia: Na te Haaringi, i ta ki tona Whare Perehi Pukapuka, 1891. Maramataka Mo Te Tau 1893, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1892. Maramataka Mo Te Tau 1894, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1893. Maramataka Me Te Tau 1895, Kihipane: Na te Muri i Ta ki tona whare perehi pukapuka, 1894. Maramataka Mo Te Tau 1896, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka, 1895. Maramataka Mo Te Tau 1897, Kihipane: Na te Muri i ta ki tona Whare Perehi Pukapuka 1896. Maramataka Mo Te Tau 1898, Turanga: Na te Wiremu Hapata i ta ki Te Rau Kahikatea, 1897. Ko Te Paipera Tapu Ara, Ko Te Kawenata Tawhito Me Te Kawenata Hou, He Mea Whakamaori Mai No Nga Reo I Oroko-Tuhituhia Ai, Ranana: He mea ta ki te perehi a W.M.Watts ma te Komiti Ta Paipera mo Ingarangi mo Te Ao Katoa, 1868. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona, Me Nga Himene, Ranana: I taia tenei ki te perehi o te Komiti mo te whakapuaki i to mohiotanga ki a te Karaiti, 1858. Ko Te Pukapuka O Nga Inoinga, Me Era Atu Tikanga, I Whakaritea E Te Hahi O Ingarani, Mo Te Minitatanga O Nga Hakarameta, O Era Atu Ritenga a Te Hahi: Me Nga Waiata Ano Hoki a Rawiri, Me Te Tikanga Mo Te Whiriwhiringa, Mo Te Whakaturanga, Me Te Whakatapunga O Nga Pihopa, O Nga Piriti, Me Nga Rikona. 1883. The Book of Common Prayer, and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Use of the United Church of England and Ireland: Together with the Proper Lessons for Sundays and Other Holy-Days, and a New Version of the Psalms of David, Oxford: Printed at 134 the University Press, 1852. The Book of Common Prayer and Administration of the Sacraments, and Other Rites and Ceremonies of the Church, According to the Church of England: Together with the Psalter or Psalms of David, Printed as They Are to Be Sung or Said in Churches: And the Form and Manner of Making, Ordaining, and Consecrating of Bishops, Priests, and Deacons, London: G.E. Eyre and W. Spottiswoode, after 1871 but before 1877. Brown, A.N., The Journals of A.N. Brown C.M.S. Missionary Tauranga Covering the Years 1840 to 1842, Tauranga: The Elms Trust, 1990 (Commemorative Edition). ______________, Select Sermons of A.N. Brown, Tauranga: The Elms Trust, 1997. Fitzgerald, Caroline (ed.), Te Wiremu Henry Williams: Early Years in the North, Wellington: Huia Publishers, 2011. The Hawke's Bay Almanac, Napier: James Wood, Hawke's Bay Herald, 1862, 1863, 1867.");
    585595
Note: See TracChangeset for help on using the changeset viewer.