Changeset 33854

Show
Ignore:
Timestamp:
21.01.2020 22:01:07 (4 weeks ago)
Author:
ak19
Message:

Manually gone over around 150 webpages of sample size of 255 webpages from NZ checking whether those for which isMRI=true was detected is indeed the case. Also have been sampling an almost equal number of NZ webpages for which isMRI=false yet containsMRI=true.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json

    r33848 r33854  
    2929 
    3030"_id","siteCount","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI" 
    31 "nz","176.0","4360","9641" 
     31"nz","176.0" containsMRI vs 96 pages inMRI,"4360","9641" in 176 containsMRI pages vs 7968 in isMRI pages 
    3232"us","29.0", 
    3333    1+2+0+0+4+166+0+39 +257+2+21+12+25+13+53+0+1+0+1+11 +32+37+4 +0+0+0 = 681, 
     
    4646"ie","1.0","1","3", https://coggle.it/diagram/WSYB0mLA2QABD5BH/t/ko-au-ko-koe 
    4747 
     48 
     49 
     50 
     51 
     52-------------- 
     53 
     54https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/#CI1 
     55https://stats.stackexchange.com/questions/207584/sample-size-choice-with-binary-outcome 
     56https://www.statisticshowto.datasciencecentral.com/z-alpha2-za2/ 
     57 
     58N (NZ pages where isMRI comes out true) = 4360 
     59solving for n, the sample size 
     60confidence level = 90% 
     61m, margin of error = 5% 
     62 
     63From the "z alpha/2" table, for 90% confidence, we get a z alpha/2 value of 1.6449 (or 1.645). 
     64 
     65Then the sample size, n, we need is = 1.6449^2 * 4360 / ( 1.6449^2 + (4 * 4359) * 0.05^2) = 255 (rounded up) 
     66 
     67 
     68For N = 681,  
     69sample size n is = 1.6449^2 * 681 / ( 1.6449^2 + (4 * 680) * 0.05^2) = 194 (rounded up) 
     70 
     71 
     72sample size for NZ: 255 (90% confidence with 5% margine of error, Including a finite correction factor) 
     73sample size for US: 194 
     74 
    4875*/ 
    4976 
     
    6794 
    6895 
     96 
     97NZ - sample 255 pages from: 
     98/* 
     99db.Websites.aggregate([ 
     100    { 
     101        $match: { 
     102            $and: [ 
     103                {numPagesContainingMRI: {$gt: 0}}, 
     104                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} 
     105            ] 
     106        } 
     107    }, 
     108    { $unwind: "$geoLocationCountryCode" }, 
     109    { 
     110        $group: { 
     111            _id: "nz", 
     112            count: { $sum: 1 }, 
     113            domain: { $addToSet: '$domain' }, 
     114            numPagesInMRICount: { $sum: '$numPagesInMRI' }, 
     115            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } 
     116        } 
     117    }, 
     118    { $sort : { count : -1} } 
     119]); 
     120 
     121 
     122OR is this better: 
     123 
     124db.Websites.aggregate([ 
     125    { 
     126        $match: { 
     127            $and: [ 
     128                {numPagesInMRI: {$gt: 0}}, 
     129                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} 
     130            ] 
     131        } 
     132    }, 
     133    { $unwind: "$geoLocationCountryCode" }, 
     134    { 
     135        $group: { 
     136            _id: "nz", 
     137            count: { $sum: 1 }, 
     138            domain: { $addToSet: '$domain' }, 
     139            numPagesInMRICount: { $sum: '$numPagesInMRI' }, 
     140            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } 
     141        } 
     142    }, 
     143    { $sort : { count : -1} } 
     144]); 
     145*/ 
     146 
     147num NZ sites with > 0 isMRI pages = 96 
     148Total numPagesInMRI in NZ sites = 4360 
     149Total numPagesContainingMRI in NZ sites = 7968 
     150 
     151Using the results you get a list of domains that matched. 171 nz domains, though it should be 176? -1 
     152 
     153Copy each domain (up to 255 of them) and look for the first 1 or 2 max that matches isMRI: 
     154 
     1551. db.getCollection('Webpages').find({URL:/pukekohe.directorybusiness.co.nz/, isMRI: true}) - check it contains a positive number of pages in MRI and check the first 1-2 pages to make sure they are indeed in MRI. Note down the ratio of MRI finds. e.g. 2/2. 
     156 
     1572. Find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI. Note down the ratio for the first 2 pages. 
     158db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true}) 
     159 
     160 
     161 
     162/* 1 */ 
     163{ 
     164    "_id" : "nz", 
     165    "count" : 96.0, 
     166    "domain" : [  
     167        "http://www.teipukarea.maori.nz", 3/3 1/3 
     168        "http://ngatipahauwera.co.nz", 2/2, 2/2 
     169        "http://www.oag.govt.nz", 2/2 0/2 
     170        "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3 
     171        "http://tmoa.tki.org.nz", 3/3 3/3 
     172        "http://www.tewhanake.maori.nz", 3/3 2/3 
     173        "http://www.matarikifestival.org.nz", 4/4 0/3 
     174        "http://www.otepoti.school.nz", 3/3 0/4 
     175!!        "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages] 
     176        "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages] 
     177        "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence] 
     178!!        "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI song titles] 0 [no other pages containsMRI] 
     179        "http://maori.livingheritage.org.nz", 2/2 2/2 
     180        "http://pukoro.co.nz", 2/2 0/2 
     181        "https://register.tpota.org.nz", 0/1 [form] 0/2 
     182X        "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz",  0/4, 1/3 [but audio content may be in MRI] 
     183!!        "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages] 
     184!        "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3 
     185        "http://kurataiao.tki.org.nz", 3/3, 1/total 3 
     186 
     187!!        "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages] 
     188        "http://teaohou.natlib.govt.nz", 4/4, 2/4 
     189        "http://www.tuwharetoa.iwi.nz", 2/3 0/3 
     190X        "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY 
     191        "https://www.terito.school.nz", 3/3, 0/2 total 
     192        "https://ttw1.cwp.govt.nz", 3/3 3/3 
     193        "https://www.whanau-tahi.school.nz", 4/4, 1/2 total 
     194        "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total 
     195        "https://teaomaori.news", 3/3, 0/1 total 
     196        "http://tetaurawhiri.govt.nz", 3/3 /3/3 [Māori Language Commission site] 
     197        "https://www.tuiatematangi.ac.nz", 4/4 3/3 
     198        "http://animations.tewhanake.maori.nz", 3/3 3/3 
     199!!       "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages] 
     200!!        "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages] 
     201        "http://www.28maoribattalion.org.nz", 3/3, 1/3 
     202        "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3 
     203        "http://www.brettgraham.co.nz", 1/1 total, 0/3 
     204!!        "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages] 
     205 
     206        "http://anglicanprayerbook.nz", 3/3 3/3 
     207        "http://arataua.nz", 4/4, 2/3 
     208        "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz] 
     209        "http://maori.tki.org.nz", 3/3 3/3 
     210DONE (with/out www):        "http://www.firstworldwar.tki.org.nz",  
     211X        "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages] 
     212        "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages] 
     213        "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages] 
     214        "https://curriculumtool.education.govt.nz", 4/4, 3/3 
     215        "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] 
     216        "http://kete.wcl.govt.nz", 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3 
     217        "http://www.kkmmaungarongo.co.nz", 3/3, 3/3 
     218        "http://www.heartland.co.nz", 3/3, 1/1 total 
     219        "http://oilcrash.com", 2/2 total, 0/3 
     220        "http://www.kura-porirua.school.nz", 4/4, 2/3 
     221        "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] 
     222        "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages] 
     223        "https://www.tematawai.maori.nz", 3/3, 3/3 
     224 
     225        "https://www.terakipaewhenua.school.nz",  
     226        "http://www.tetaurawhiri.govt.nz",  
     227        "http://archive.stats.govt.nz",  
     228        "http://tiritiowaitangi.govt.nz",  
     229        "http://www.waiata.maori.nz",  
     230        "http://hana.co.nz",  
     231        "http://kaupare.co.nz",  
     232        "http://www.tereowrap.nz",  
     233        "https://www.e-agent.nz",  
     234        "http://www.hrc.co.nz",  
     235        "http://ngatiporoukiponeke.org.nz",  
     236        "http://rurued.school.nz",  
     237        "http://www.twtop.school.nz",  
     238        "https://www.infinite-electronic.nz",  
     239        "http://www.huri-translations.pf",  
     240        "https://admin.teara.govt.nz",  
     241        "https://tiritiowaitangi.govt.nz",  
     242        "http://www.tmoa.tki.org.nz",  
     243        "https://www.komako.org.nz",  
     244        "http://www.wcl.govt.nz",  
     245        "https://office.e-agent.nz",  
     246        "http://punareo.co.nz",  
     247        "http://www.kurakokiri.maori.nz",  
     248        "https://rapuatearatika.education.govt.nz",  
     249        "http://tmmkkm.school.nz",  
     250        "https://www.components-mart.nz",  
     251        "http://www.cs.waikato.ac.nz",  
     252        "http://www.kupengahao.co.nz",  
     253        "https://www.hapuhauora.health.nz",  
     254        "https://www.lcds-display.nz",  
     255        "http://waiata.maori.nz",  
     256        "http://cms.sunsmartschools.co.nz",  
     257        "http://www.livingheritage.org.nz",  
     258        "http://kuraproductions.co.nz",  
     259        "https://keepourmoneyclean.govt.nz",  
     260        "http://www.tekura.school.nz",  
     261        "http://www.tkkmmokopuna.school.nz",  
     262        "http://hangaraumatihiko.tki.org.nz",  
     263        "http://www.pakanae.maori.nz" 
     264    ], 
     265    "numPagesInMRICount" : 4360, 
     266    "numPagesContainingMRICount" : 7968 
     267} 
     268 
     269---------------------------- 
     270 
     271/* 1 */ 
     272{ 
     273    "_id" : "nz", 
     274    "count" : 176.0, 
     275    "domain" : [  
     276!!        "http://pukekohe.directorybusiness.co.nz", 0/2, 0/2, isMRI = 0!! 
     277        "http://maori.livingheritage.org.nz", 2/2 2/2 
     278        "http://pukoro.co.nz", 2/2 0/2 
     279        "http://www.rakaumanga.school.nz", 0/4 0/4  
     280        "http://www.ngamanawainc.co.nz", 0/2 0/2 
     281        "https://office.e-agent.nz",  
     282        "https://www.components-mart.nz",  
     283        "http://tmmkkm.school.nz",  
     284        "http://www.rotoruanz.com",  
     285        "http://www.huri-translations.pf",  
     286        "https://admin.teara.govt.nz",  
     287        "http://hangaraumatihiko.tki.org.nz",  
     288        "https://sexualviolence.victimsinfo.govt.nz",  
     289        "http://www.tekura.school.nz",  
     290        "http://philipbeadle.co.nz",  
     291        "http://www.cs.waikato.ac.nz",  
     292        "https://www.hapuhauora.health.nz",  
     293        "http://cms.sunsmartschools.co.nz",  
     294        "https://keepourmoneyclean.govt.nz",  
     295        "http://www.kura-porirua.school.nz",  
     296        "http://waitarahistory.org.nz",  
     297        "http://oilcrash.com",  
     298        "http://videos.e-agent.nz",  
     299        "https://manawatuheritage.pncc.govt.nz",  
     300        "https://www.terakipaewhenua.school.nz",  
     301        "http://dev.nzpcn.org.nz",  
     302        "https://kotahimiriona.co.nz",  
     303        "http://kurakokiri.maori.nz",  
     304        "https://www.sporty.co.nz",  
     305        "http://kaupare.co.nz",  
     306        "http://ngatiporoukiponeke.org.nz",  
     307        "https://www.takitimu.ac.nz",  
     308        "http://www.tetaurawhiri.govt.nz",  
     309        "http://www.waiata.maori.nz",  
     310        "http://conference.tpwt.maori.nz",  
     311        "http://ngatiwhakaue.iwi.nz",  
     312        "http://www.nzpcn.org.nz",  
     313        "http://www.ruralfind.co.nz",  
     314        "https://www.dnc.org.nz",  
     315        "https://www.puau.school.nz",  
     316        "https://kaiiwicamp.nz",  
     317        "https://www.terito.school.nz",  
     318        "https://www.pinterest.nz",  
     319        "https://e-ako-pangarau.nzmaths.co.nz",  
     320        "http://givealittle.co.nz",  
     321        "https://teaomaori.news",  
     322        "https://www.korokikahukura.co.nz",  
     323        "http://myfathersworld.net.nz",  
     324        "http://www.firstworldwar.tki.org.nz",  
     325        "https://www.ashtangatauranga.co.nz",  
     326        "http://biketorqueyamaha.co.nz",  
     327        "https://www.rereahu.maori.nz",  
     328        "http://www.tewikiotereomaori.co.nz",  
     329        "http://www.brettgraham.co.nz",  
     330        "http://tewikiotereomaori.nz",  
     331        "http://anglicanprayerbook.nz",  
     332        "http://arataua.nz",  
     333        "http://blog.teara.govt.nz",  
     334        "http://www.otepoti.school.nz",  
     335        "http://www.kmk.maori.nz",  
     336        "http://www.eventcinemas.co.nz",  
     337        "https://www.stats.govt.nz",  
     338        "http://www.oag.govt.nz", 2/2 0/2 
     339        "http://whatonga.school.nz",  
     340        "http://www.tewhanake.maori.nz",  
     341        "https://www.maoritelevision.com",  
     342        "http://kuraaiwi.maori.nz",  
     343        "http://kurataiao.tki.org.nz",  
     344        "http://teaohou.natlib.govt.nz",  
     345        "http://www.tetaumuturunanga.iwi.nz",  
     346        "http://www.tasteofplenty.co.nz",  
     347        "http://community.nzdl.org",  
     348        "https://www.blushandbrows.nz",  
     349        "https://register.tpota.org.nz",  
     350        "https://cdn.tehiku.nz",  
     351        "http://www.wcl.govt.nz",  
     352        "http://www.jeremybaker.nz",  
     353        "http://punareo.co.nz",  
     354        "https://rapuatearatika.education.govt.nz",  
     355        "http://www.kurakokiri.maori.nz",  
     356        "https://www.cruisetourstauranga.co.nz",  
     357        "https://sooty.nz",  
     358        "http://rakaumanga.school.nz",  
     359        "https://tiritiowaitangi.govt.nz",  
     360        "http://www.tmoa.tki.org.nz",  
     361        "http://www.w3vietnam.org.nz",  
     362        "https://www.infinite-electronic.nz",  
     363        "https://www.komako.org.nz",  
     364        "http://nzpostcard.co.nz",  
     365        "http://artizani.co.nz",  
     366        "http://www.finlaysonpark.school.nz",  
     367        "http://crimson.co.nz",  
     368        "http://holyspirit.nz",  
     369        "http://www.tkkmmokopuna.school.nz",  
     370        "http://www.pakanae.maori.nz",  
     371        "http://www.teipukarea.maori.nz",  
     372        "http://archerpix.com",  
     373        "https://2019.nethui.nz",  
     374        "http://www.kupengahao.co.nz",  
     375        "https://www.lcds-display.nz",  
     376        "http://waiata.maori.nz",  
     377        "http://kuraproductions.co.nz",  
     378        "http://www.biketorqueyamaha.co.nz",  
     379        "http://www.livingheritage.org.nz",  
     380        "http://www.zoomin.co.nz",  
     381        "http://rsnz.natlib.govt.nz",  
     382        "http://otorohanga.directorybusiness.co.nz",  
     383        "http://reoora.co.nz",  
     384        "http://w3vietnam.org.nz",  
     385        "https://rehuamarae.co.nz",  
     386        "https://www.electionresults.org.nz",  
     387        "https://www.ngamanawainc.co.nz",  
     388        "https://www.rotorua-rafting.co.nz",  
     389        "https://www.taitokerautrust.org.nz",  
     390        "https://www.wingspan.co.nz",  
     391        "http://www.kkmmaungarongo.co.nz",  
     392        "http://kete.wcl.govt.nz",  
     393        "http://www.heartland.co.nz",  
     394        "http://www.electionresults.govt.nz",  
     395        "https://www.tematawai.maori.nz",  
     396        "http://hana.co.nz",  
     397        "http://www.tereowrap.nz",  
     398        "http://rurued.school.nz",  
     399        "http://www.twtop.school.nz",  
     400        "http://rexedra.gen.nz",  
     401        "http://archive.stats.govt.nz",  
     402        "https://liveresults.co.nz",  
     403        "https://www.e-agent.nz",  
     404        "http://tiritiowaitangi.govt.nz",  
     405        "http://www.hrc.co.nz",  
     406        "http://animations.tewhanake.maori.nz",  
     407        "https://interactives.stuff.co.nz",  
     408        "http://avonside.net",  
     409        "http://www.methodist.org.nz",  
     410        "https://www.tasteofplenty.co.nz",  
     411        "http://www.maoriinvestments.co.nz",  
     412        "https://m.wairarapatv.co.nz",  
     413        "http://www.gans.co.nz",  
     414        "https://ttw1.cwp.govt.nz",  
     415        "http://ngarauhuia.ngatiapakiterato.iwi.nz",  
     416        "https://www.tuiatematangi.ac.nz",  
     417        "http://tetaurawhiri.govt.nz",  
     418        "http://maori.tki.org.nz",  
     419        "http://www.topomap.co.nz",  
     420        "https://www.puhaandpakeha.co.nz",  
     421        "https://haereheikaiako.co.nz",  
     422        "https://paekupu.co.nz",  
     423        "https://curriculumtool.education.govt.nz",  
     424        "http://firstworldwar.tki.org.nz",  
     425        "http://www.28maoribattalion.org.nz",  
     426        "https://hepatakakupu.nz",  
     427        "https://www.zenbu.co.nz",  
     428        "http://www.matarikifestival.org.nz",  
     429        "http://pukapuka.nz",  
     430        "http://ngatipahauwera.co.nz", 2/2 2/2 
     431        "http://southerntribes.co.nz",  
     432        "https://player.vimeo.com",  
     433        "http://tmoa.tki.org.nz",  
     434        "http://www.writersfestival.co.nz",  
     435        "http://talkingtothecan.com",  
     436        "https://www.whanau-tahi.school.nz",  
     437        "http://satellites.co.nz",  
     438        "http://auturoa.nz",  
     439        "http://www.tuwharetoa.iwi.nz",  
     440        "http://kmpmusic.co.nz",  
     441        "http://www.temarareo.org",  
     442        "http://archive.electionresults.govt.nz",  
     443        "http://kaiiwicamp.nz",  
     444        "http://tehauora.org.nz",  
     445        "http://temahurehure.maori.nz",  
     446        "http://www.runanga.co.nz" 
     447    ], 
     448    "numPagesInMRICount" : 4360, 
     449    "numPagesContainingMRICount" : 9641 
     450} 
     451 
     452