Changeset 33890

Show
Ignore:
Timestamp:
03.02.2020 20:31:33 (2 weeks ago)
Author:
ak19
Message:

Finished going through NZ sites listing of numPagesContainingMRI > 0 and manually determining which of these sites really contained at least one webpage containing at least one sentence inMRI.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json

    r33884 r33890  
    199199        "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages] 
    200200        "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence] 
    201 !!        "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI song titles] 0 [no other pages containsMRI] 
    202         "http://maori.livingheritage.org.nz", 2/2 2/2 
     201X!!        "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI album and song titles] 0 [no other pages containsMRI] 
     202        "http://maori.livingheritage.org.nz", 2/2 2/2 {includes: http://www.livingheritage.org.nz} 
    203203        "http://pukoro.co.nz", 2/2 0/2 
    204         "https://register.tpota.org.nz", 0/1 [form] 0/2 
    205 X        "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz",  0/4, 1/3 [but audio content may be in MRI] 
     204X        "https://register.tpota.org.nz", 0/1 [form] 0/2 
     205+        "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz",  0/4, 1/3 [but audio content may be in MRI] But there are pages containing MRI to be found by non-random sampling of tehiku.nz, e.g. https://tehiku.nz/te-hiku-radio/te-tangihanga-ki-a-erima-henare/ contains MRI sentences 
    206206!!        "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages] 
    207207!        "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3 
     
    211211        "http://teaohou.natlib.govt.nz", 4/4, 2/4 
    212212        "http://www.tuwharetoa.iwi.nz", 2/3 0/3 
    213 X        "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY 
     213+        "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html) 
    214214        "https://www.terito.school.nz", 3/3, 0/2 total 
    215215        "https://ttw1.cwp.govt.nz", 3/3 3/3 
     
    228228 
    229229        "http://anglicanprayerbook.nz", 3/3 3/3 
    230         "http://arataua.nz", 4/4, 2/3 
    231         "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz] 
     230        "http://arataua.nz", 4/4, 2/3         
    232231        "http://maori.tki.org.nz", 3/3 3/3 
    233232DONE (with/out www):        "http://www.firstworldwar.tki.org.nz",  
     
    236235        "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages] 
    237236        "https://curriculumtool.education.govt.nz", 4/4, 3/3 
    238         "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] 
    239         "http://kete.wcl.govt.nz", 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3 
     237        "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] {includes: "http://www.kurakokiri.maori.nz"}    
    240238        "http://www.kkmmaungarongo.co.nz", 3/3, 3/3 
    241239        "http://www.heartland.co.nz", 3/3, 1/1 total 
    242240        "http://oilcrash.com", 2/2 total, 0/3 
    243         "http://www.kura-porirua.school.nz", 4/4, 2/3 
    244         "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] 
     241        "http://www.kura-porirua.school.nz", 4/4, 2/3        
    245242        "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages] 
    246243        "https://www.tematawai.maori.nz", 3/3, 3/3 
    247244 
    248         "https://www.terakipaewhenua.school.nz",  
    249         "http://www.tetaurawhiri.govt.nz",  
    250         "http://archive.stats.govt.nz",  
    251         "http://tiritiowaitangi.govt.nz",  
    252         "http://www.waiata.maori.nz",  
    253         "http://hana.co.nz",  
    254         "http://kaupare.co.nz",  
    255         "http://www.tereowrap.nz",  
    256         "https://www.e-agent.nz",  
    257         "http://www.hrc.co.nz",  
    258         "http://ngatiporoukiponeke.org.nz",  
    259         "http://rurued.school.nz",  
    260         "http://www.twtop.school.nz",  
    261         "https://www.infinite-electronic.nz",  
    262         "http://www.huri-translations.pf",  
    263         "https://admin.teara.govt.nz",  
    264         "https://tiritiowaitangi.govt.nz",  
    265         "http://www.tmoa.tki.org.nz",  
    266         "https://www.komako.org.nz",  
    267         "http://www.wcl.govt.nz",  
    268         "https://office.e-agent.nz",  
    269         "http://punareo.co.nz",  
    270         "http://www.kurakokiri.maori.nz",  
    271         "https://rapuatearatika.education.govt.nz",  
    272         "http://tmmkkm.school.nz",  
    273         "https://www.components-mart.nz",  
    274         "http://www.cs.waikato.ac.nz",  
    275         "http://www.kupengahao.co.nz",  
    276         "https://www.hapuhauora.health.nz",  
    277         "https://www.lcds-display.nz",  
    278         "http://waiata.maori.nz",  
    279         "http://cms.sunsmartschools.co.nz",  
    280         "http://www.livingheritage.org.nz",  
    281         "http://kuraproductions.co.nz",  
    282         "https://keepourmoneyclean.govt.nz",  
    283         "http://www.tekura.school.nz",  
    284         "http://www.tkkmmokopuna.school.nz",  
    285         "http://hangaraumatihiko.tki.org.nz",  
    286         "http://www.pakanae.maori.nz" 
     245+        "https://www.terakipaewhenua.school.nz",  
     246+        "http://www.tetaurawhiri.govt.nz",  
     247+        "http://archive.stats.govt.nz", (1 page isMRI) 
     248+        "http://tiritiowaitangi.govt.nz",  
     249+!!      "http://www.waiata.maori.nz", {includes: "http://waiata.maori.nz"} 
     250+        "http://hana.co.nz", [crawled version of page contains MRI sentences, but current page is labelled for mobiles whereas in browser it just has a picture] 
     251+        "http://kaupare.co.nz",  
     252+        "http://www.tereowrap.nz",  
     253?X        "https://www.e-agent.nz", [autotranslated? SEO related site in NZ, Chinese, English and MRI] {includes: "https://office.e-agent.nz"} 
     254                 { included: "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] }              
     255+        "http://www.hrc.co.nz",  
     256+        "http://ngatiporoukiponeke.org.nz",  
     257 
     258+        "http://rurued.school.nz",  
     259+        "http://www.twtop.school.nz",  
     260X        "https://www.infinite-electronic.nz", [autotranslated product site] 
     261+!!      "http://www.huri-translations.pf",  
     262+        "https://admin.teara.govt.nz", {included: "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz]} 
     263+!!        "https://tiritiowaitangi.govt.nz",  
     264+        "http://www.tmoa.tki.org.nz",  
     265+        "https://www.komako.org.nz", [no longer available, but crawled page had a paragraph in Maori with presumably a translation thereafter] 
     266+        "http://www.wcl.govt.nz", {included: "http://kete.wcl.govt.nz" as wcl.govt.nz; 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3}         
     267+!!      "http://punareo.co.nz", [waiata]         
     268 
     269+        "https://rapuatearatika.education.govt.nz",  
     270+        "http://tmmkkm.school.nz",  
     271X        "https://www.components-mart.nz",  [autotranslated product site] 
     272+        "http://www.cs.waikato.ac.nz", [Te Taka's pages!] 
     273+!!!        "http://www.kupengahao.co.nz", [MRI language books and resources] 
     274+        "https://www.hapuhauora.health.nz", [Smokefree site. The one isMRI page has a proper sentence.] 
     275X        "https://www.lcds-display.nz",  [autotranslated product site]         
     276+        "http://cms.sunsmartschools.co.nz", [as sunsmartschools.co.nz, e.g. http://sunsmartschools.co.nz/kura/how-to-become.html]     
     277+        "http://kuraproductions.co.nz",  
     278+        "https://keepourmoneyclean.govt.nz", [1 page] 
     279 
     280+!!      "http://www.tekura.school.nz",  
     281+        "http://www.tkkmmokopuna.school.nz", [e.g. http://www.tkkmmokopuna.school.nz/newsletter_sets/newsletters/13-wahanga-4-wiki-5-he-puna-korero] 
     282+        "http://hangaraumatihiko.tki.org.nz", [govt website. e.g. http://hangaraumatihiko.tki.org.nz/te-tupuranga-tangata-me-te-rorohiko/whakatupuranga-3-te-whakatau-tikanga-kawe-i-nga-mahi-whakaoti-hopanga/] 
     283+        "http://www.pakanae.maori.nz" 
    287284    ], 
    288285    "numPagesInMRICount" : 4360, 
     
    290287} 
    291288 
     289 
     29096 sites detected as having isMRI pages - 7 sites/subdomains already included in the existing 96 = 89 sites. 
     291 
     292-2.5* product sites -2 non-MRI sites with songlistings or forms etc 
     293    *0.5 for e-agent.nz site 
     294= 84.5 sites total that at least contain MRI, most have pages inMRI. 
    292295---------------------------- 
    293296 
     
    474477 
    475478 
     479---------------------------- 
     480 
     481The remainder: 80 NZ sites detected as not containing pages InMRI, but with positive number of pages detected as containsMRI: 
     482 
     483db.Websites.aggregate([ 
     484    { 
     485        $match: { 
     486            $and: [ 
     487                {numPagesContainingMRI: {$gt: 0}}, 
     488                {numPagesInMRI: {$eq: 0}}, 
     489                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} 
     490            ] 
     491        } 
     492    }, 
     493    { $unwind: "$geoLocationCountryCode" }, 
     494    { 
     495        $group: { 
     496            _id: "nz", 
     497            count: { $sum: 1 }, 
     498            domain: { $addToSet: '$domain' }, 
     499            numPagesInMRICount: { $sum: '$numPagesInMRI' }, 
     500            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } 
     501        } 
     502    }, 
     503    { $sort : { count : -1} } 
     504]); 
     505 
     506 
     507Find pages for testing with: 
     508    db.getCollection('Webpages').find({URL:/ashtangatauranga\.co\.nz/, containsMRI: true, mriSentenceCount: {$gt: 0}}) 
     509 
     510 
     511/* 1 */ 
     512{ 
     513    "_id" : "nz", 
     514    "count" : 80.0, 
     515    "domain" : [  
     516X        "http://www.zoomin.co.nz", [map site, so placenames] 
     517X        "http://www.biketorqueyamaha.co.nz", [placenames] {includes "http://biketorqueyamaha.co.nz"} 
     518X        "http://archerpix.com", [photo captions containing placenames] 
     519X        "http://philipbeadle.co.nz", [art captions containing placenames] 
     520X        "https://2019.nethui.nz", [Just MRI words in ENG sentences] 
     521X        "http://crimson.co.nz", [address] 
     522+        "http://holyspirit.nz", (e.g. https://holyspirit.nz/wp-content/uploads/02-25-2018-Newsletter.pdf) 
     523X        "https://www.wingspan.co.nz", [1 page, 1 conjoined word, looks like placename] 
     524X        "http://nzpostcard.co.nz", [postcards with placenames] 
     525+        "https://www.ngamanawainc.co.nz", (e.g. https://www.ngamanawainc.co.nz/history/id/60) {includes "http://www.ngamanawainc.co.nz"} 
     526 
     527+        "http://www.finlaysonpark.school.nz", [e.g. http://www.finlaysonpark.school.nz/58/pages/61-team-overview; Some Tongan language pages and Samoan pages] 
     528X        "http://artizani.co.nz", [address] 
     529+        "http://www.w3vietnam.org.nz", [e.g. http://www.w3vietnam.org.nz/w3%20mihi%20ki%20ka%20hoia.htm] (includes "http://w3vietnam.org.nz") 
     530X        "https://sooty.nz", [names, war death notices, place names] 
     531X?        "http://rakaumanga.school.nz", [School has a longer Maori title name, but that's it] {includes "http://www.rakaumanga.school.nz"} 
     532X        "http://www.rotoruanz.com", [e.g. https://www.rotoruanz.com/RNZ/media/Media-Library/SOI-Board-version-08-June-2018-final-published-on-website.pdf] 
     533X        "https://www.cruisetourstauranga.co.nz", [English with Tauranga being MRI placename] 
     534X        "http://www.jeremybaker.nz", [one word, HOkio] 
     535 
     536X        "https://liveresults.co.nz", [canoe sports team names] 
     537X        "http://rexedra.gen.nz", [ENG sentence with MRI words] 
     538+        "https://www.takitimu.ac.nz", [school of performing arts, example MRI sentence at https://www.takitimu.ac.nz/about-us] 
     539X        "http://www.electionresults.govt.nz", [placenames and misdetection of Return to Home Page] {includes https://www.electionresults.org.nz which seems to be an alias for the same} {includes "http://archive.electionresults.govt.nz"} 
     540+        "https://kotahimiriona.co.nz", e.g. (https://kotahimiriona.co.nz/about-the-app/) 
     541+        "https://rehuamarae.co.nz", (e.g. https://rehuamarae.co.nz/) 
     542+        "http://reoora.co.nz", (e.g. https://reoora.co.nz/about/) 
     543 
     544X        "http://otorohanga.directorybusiness.co.nz", [placenames] 
     545X        "http://waitarahistory.org.nz", [placenames and misdetection of sentences of the form "31, No 262." as MRI] 
     546+        "https://manawatuheritage.pncc.govt.nz", (e.g. https://manawatuheritage.pncc.govt.nz/about) 
     547+        "http://rsnz.natlib.govt.nz", [e.g. http://rsnz.natlib.govt.nz/volume/rsnz_26/rsnz_26_00_004560.html] NOTE: site appears to use Greenstone 
     548X        "https://www.rotorua-rafting.co.nz", [placenames] 
     549+        "https://www.taitokerautrust.org.nz", (e.g. https://www.taitokerautrust.org.nz/) 
     550+        "http://tewikiotereomaori.nz", (e.g. https://tewikiotereomaori.nz/about/) 
     551+        "https://www.korokikahukura.co.nz", (e.g. https://www.korokikahukura.co.nz/ko-wai-m257tou.html about the connection to the Waikato River) 
     552 
     553X        "https://www.puhaandpakeha.co.nz", [ENG sentences with MRI words] 
     554X        "http://myfathersworld.net.nz", [placenames] 
     555X        "https://www.ashtangatauranga.co.nz", [misdetection] 
     556+        "https://www.pinterest.nz", (e.g. sentence "E tu Whare Rangi, e tu Whare Pataka, kia ora koutou nga Whare Tipuna o te Ao Maori." at https://www.pinterest.nz/amp/marylizw03/maori/) 
     557+        "https://www.rereahu.maori.nz", (e.g. https://www.rereahu.maori.nz/uploads/72717/files/168860/Te_Maru_o_Rereahu_Newsletter_No_5_February_2011.pdf) 
     558+        "http://givealittle.co.nz", (e.g. https://givealittle.co.nz/cause/rebuild-tapu-te-ranga-marae which contains greetings phrases and sentence ""Nā te ringa tangata i hanga te whare Nā te tuarā o te whare i whakatipu i te tangata") 
     559X        "http://www.gans.co.nz", [placenames] 
     560+        "https://kaiiwicamp.nz", [placenames] {includes "http://kaiiwicamp.nz"} 
     561+        "http://ngarauhuia.ngatiapakiterato.iwi.nz", (e.g. http://ngarauhuia.ngatiapakiterato.iwi.nz/pdf/Nga-Rau-Huia_Raumati_2015_Issue1.pdf) 
     562+        "https://m.wairarapatv.co.nz", (e.g. https://m.wairarapatv.co.nz/archive/i/JGmJqCZCNK8/te-ara-whanui-kura-kaupapa-maori-o-nga-khanga-reo-o-te-awa-kairangi) 
     563 
     564X        "http://www.methodist.org.nz", [ENG sentence with MRI words] 
     565+        "http://avonside.net", (e.g. http://avonside.net/Nicky/indexmain.htm) 
     566X        "http://www.ruralfind.co.nz", [placenames] 
     567+        "http://www.maoriinvestments.co.nz", (e.g. Vision, Mission & Values tab of https://maoriinvestments.co.nz/about/organisation) 
     568+        "http://conference.tpwt.maori.nz", (e.g. https://www.tpwt.maori.nz/te-mahe-matauranga/) 
     569+        "https://www.puau.school.nz", (e.g. https://www.puau.school.nz/k%C4%81inga-home) 
     570+?        "http://ngatiwhakaue.iwi.nz", (e.g. greetings at http://ngatiwhakaue.iwi.nz/) 
     571X        "http://www.nzpcn.org.nz", [lots of misdetections and one sentence in English with MRI words, e.g. "Kaori is a Taonga to Maori".] {includes "http://dev.nzpcn.org.nz"} 
     572+?        "https://interactives.stuff.co.nz", [2 x page mostly empty except for the title "TE WIKI O TE REO MĀORI Māori"] 
     573+        "http://tehauora.org.nz", (e.g. http://tehauora.org.nz/ and proverb at bottom of http://tehauora.org.nz/about-us) 
     574 
     575+        "http://temahurehure.maori.nz", (e.g. greeting http://temahurehure.maori.nz/site/file/temp/hirebook.pdf) 
     576X        "http://pukekohe.directorybusiness.co.nz", [placenames] 
     577+!!      "http://www.temarareo.org", (dictionary with sample sentences. e.g. http://www.temarareo.org/PAPAKUPU/dictionary-searchresults/A-2.htm) 
     578X        "https://www.tasteofplenty.co.nz", [Tauranga placename and misdetection of "No one went away hungry] {includes "http://www.tasteofplenty.co.nz"} 
     579       
     580+        "http://www.tetaumuturunanga.iwi.nz", (e.g. http://www.tetaumuturunanga.iwi.nz/wp-content/uploads/2016/04/Malvern-Cultural-Narrative-Draft.pdf) 
     581         
     582         
     583X        "https://www.blushandbrows.nz", [misdetection of "Makeup..."] 
     584X        "http://talkingtothecan.com", [misdetection of things like ""22, no." and mistaking ENG sentences with MRI words] 
     585 
     586+?       "http://whatonga.school.nz", [school title] 
     587+?       "https://player.vimeo.com", [Video titles contain MRI sentences and MRI titles interspersed in ENG sentence. Video's audio content may be in MRI] 
     588+        "http://www.writersfestival.co.nz", (e.g. http://www.writersfestival.co.nz/news/Page1/urgently-relevant-novel-wins-countrys-richest-literary-award/) 
     589+?       "http://southerntribes.co.nz", [brazilian jiu jitsu site with a greeting in MRI on main page] 
     590+        "http://www.kmk.maori.nz", (e.g. http://www.kmk.maori.nz/kmk-events) 
     591+        "https://www.stats.govt.nz", (e.g. http://archive.stats.govt.nz/Census/2001-census-data/2001-census-pacific-profiles/cook-island-maori-people-in-new-zealand.aspx) 
     592X        "http://www.eventcinemas.co.nz", [placenames and misdetection of ENG "Atura Hotels"] 
     593X        "https://www.zenbu.co.nz" [misdetection and NZ school addresses] 
     594    ], 
     595    "numPagesInMRICount" : 0, 
     596    "numPagesContainingMRICount" : 1673 
     597}