Changeset 33890


Ignore:
Timestamp:
2020-02-03T20:31:33+13:00 (4 years ago)
Author:
ak19
Message:

Finished going through NZ sites listing of numPagesContainingMRI > 0 and manually determining which of these sites really contained at least one webpage containing at least one sentence inMRI.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json

    r33884 r33890  
    199199        "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages]
    200200        "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence]
    201 !!        "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI song titles] 0 [no other pages containsMRI]
    202         "http://maori.livingheritage.org.nz", 2/2 2/2
     201X!!        "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI album and song titles] 0 [no other pages containsMRI]
     202        "http://maori.livingheritage.org.nz", 2/2 2/2 {includes: http://www.livingheritage.org.nz}
    203203        "http://pukoro.co.nz", 2/2 0/2
    204         "https://register.tpota.org.nz", 0/1 [form] 0/2
    205 X        "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz",  0/4, 1/3 [but audio content may be in MRI]
     204X        "https://register.tpota.org.nz", 0/1 [form] 0/2
     205+        "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz",  0/4, 1/3 [but audio content may be in MRI] But there are pages containing MRI to be found by non-random sampling of tehiku.nz, e.g. https://tehiku.nz/te-hiku-radio/te-tangihanga-ki-a-erima-henare/ contains MRI sentences
    206206!!        "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages]
    207207!        "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3
     
    211211        "http://teaohou.natlib.govt.nz", 4/4, 2/4
    212212        "http://www.tuwharetoa.iwi.nz", 2/3 0/3
    213 X        "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY
     213+        "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html)
    214214        "https://www.terito.school.nz", 3/3, 0/2 total
    215215        "https://ttw1.cwp.govt.nz", 3/3 3/3
     
    228228
    229229        "http://anglicanprayerbook.nz", 3/3 3/3
    230         "http://arataua.nz", 4/4, 2/3
    231         "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz]
     230        "http://arataua.nz", 4/4, 2/3       
    232231        "http://maori.tki.org.nz", 3/3 3/3
    233232DONE (with/out www):        "http://www.firstworldwar.tki.org.nz",
     
    236235        "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages]
    237236        "https://curriculumtool.education.govt.nz", 4/4, 3/3
    238         "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page]
    239         "http://kete.wcl.govt.nz", 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3
     237        "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] {includes: "http://www.kurakokiri.maori.nz"}   
    240238        "http://www.kkmmaungarongo.co.nz", 3/3, 3/3
    241239        "http://www.heartland.co.nz", 3/3, 1/1 total
    242240        "http://oilcrash.com", 2/2 total, 0/3
    243         "http://www.kura-porirua.school.nz", 4/4, 2/3
    244         "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav]
     241        "http://www.kura-porirua.school.nz", 4/4, 2/3       
    245242        "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages]
    246243        "https://www.tematawai.maori.nz", 3/3, 3/3
    247244
    248         "https://www.terakipaewhenua.school.nz",
    249         "http://www.tetaurawhiri.govt.nz",
    250         "http://archive.stats.govt.nz",
    251         "http://tiritiowaitangi.govt.nz",
    252         "http://www.waiata.maori.nz",
    253         "http://hana.co.nz",
    254         "http://kaupare.co.nz",
    255         "http://www.tereowrap.nz",
    256         "https://www.e-agent.nz",
    257         "http://www.hrc.co.nz",
    258         "http://ngatiporoukiponeke.org.nz",
    259         "http://rurued.school.nz",
    260         "http://www.twtop.school.nz",
    261         "https://www.infinite-electronic.nz",
    262         "http://www.huri-translations.pf",
    263         "https://admin.teara.govt.nz",
    264         "https://tiritiowaitangi.govt.nz",
    265         "http://www.tmoa.tki.org.nz",
    266         "https://www.komako.org.nz",
    267         "http://www.wcl.govt.nz",
    268         "https://office.e-agent.nz",
    269         "http://punareo.co.nz",
    270         "http://www.kurakokiri.maori.nz",
    271         "https://rapuatearatika.education.govt.nz",
    272         "http://tmmkkm.school.nz",
    273         "https://www.components-mart.nz",
    274         "http://www.cs.waikato.ac.nz",
    275         "http://www.kupengahao.co.nz",
    276         "https://www.hapuhauora.health.nz",
    277         "https://www.lcds-display.nz",
    278         "http://waiata.maori.nz",
    279         "http://cms.sunsmartschools.co.nz",
    280         "http://www.livingheritage.org.nz",
    281         "http://kuraproductions.co.nz",
    282         "https://keepourmoneyclean.govt.nz",
    283         "http://www.tekura.school.nz",
    284         "http://www.tkkmmokopuna.school.nz",
    285         "http://hangaraumatihiko.tki.org.nz",
    286         "http://www.pakanae.maori.nz"
     245+        "https://www.terakipaewhenua.school.nz",
     246+        "http://www.tetaurawhiri.govt.nz",
     247+        "http://archive.stats.govt.nz", (1 page isMRI)
     248+        "http://tiritiowaitangi.govt.nz",
     249+!!      "http://www.waiata.maori.nz", {includes: "http://waiata.maori.nz"}
     250+        "http://hana.co.nz", [crawled version of page contains MRI sentences, but current page is labelled for mobiles whereas in browser it just has a picture]
     251+        "http://kaupare.co.nz",
     252+        "http://www.tereowrap.nz",
     253?X        "https://www.e-agent.nz", [autotranslated? SEO related site in NZ, Chinese, English and MRI] {includes: "https://office.e-agent.nz"}
     254                 { included: "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] }             
     255+        "http://www.hrc.co.nz",
     256+        "http://ngatiporoukiponeke.org.nz",
     257
     258+        "http://rurued.school.nz",
     259+        "http://www.twtop.school.nz",
     260X        "https://www.infinite-electronic.nz", [autotranslated product site]
     261+!!      "http://www.huri-translations.pf",
     262+        "https://admin.teara.govt.nz", {included: "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz]}
     263+!!        "https://tiritiowaitangi.govt.nz",
     264+        "http://www.tmoa.tki.org.nz",
     265+        "https://www.komako.org.nz", [no longer available, but crawled page had a paragraph in Maori with presumably a translation thereafter]
     266+        "http://www.wcl.govt.nz", {included: "http://kete.wcl.govt.nz" as wcl.govt.nz; 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3}       
     267+!!      "http://punareo.co.nz", [waiata]       
     268
     269+        "https://rapuatearatika.education.govt.nz",
     270+        "http://tmmkkm.school.nz",
     271X        "https://www.components-mart.nz",  [autotranslated product site]
     272+        "http://www.cs.waikato.ac.nz", [Te Taka's pages!]
     273+!!!        "http://www.kupengahao.co.nz", [MRI language books and resources]
     274+        "https://www.hapuhauora.health.nz", [Smokefree site. The one isMRI page has a proper sentence.]
     275X        "https://www.lcds-display.nz",  [autotranslated product site]       
     276+        "http://cms.sunsmartschools.co.nz", [as sunsmartschools.co.nz, e.g. http://sunsmartschools.co.nz/kura/how-to-become.html]   
     277+        "http://kuraproductions.co.nz",
     278+        "https://keepourmoneyclean.govt.nz", [1 page]
     279
     280+!!      "http://www.tekura.school.nz",
     281+        "http://www.tkkmmokopuna.school.nz", [e.g. http://www.tkkmmokopuna.school.nz/newsletter_sets/newsletters/13-wahanga-4-wiki-5-he-puna-korero]
     282+        "http://hangaraumatihiko.tki.org.nz", [govt website. e.g. http://hangaraumatihiko.tki.org.nz/te-tupuranga-tangata-me-te-rorohiko/whakatupuranga-3-te-whakatau-tikanga-kawe-i-nga-mahi-whakaoti-hopanga/]
     283+        "http://www.pakanae.maori.nz"
    287284    ],
    288285    "numPagesInMRICount" : 4360,
     
    290287}
    291288
     289
     29096 sites detected as having isMRI pages - 7 sites/subdomains already included in the existing 96 = 89 sites.
     291
     292-2.5* product sites -2 non-MRI sites with songlistings or forms etc
     293    *0.5 for e-agent.nz site
     294= 84.5 sites total that at least contain MRI, most have pages inMRI.
    292295----------------------------
    293296
     
    474477
    475478
     479----------------------------
     480
     481The remainder: 80 NZ sites detected as not containing pages InMRI, but with positive number of pages detected as containsMRI:
     482
     483db.Websites.aggregate([
     484    {
     485        $match: {
     486            $and: [
     487                {numPagesContainingMRI: {$gt: 0}},
     488                {numPagesInMRI: {$eq: 0}},
     489                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
     490            ]
     491        }
     492    },
     493    { $unwind: "$geoLocationCountryCode" },
     494    {
     495        $group: {
     496            _id: "nz",
     497            count: { $sum: 1 },
     498            domain: { $addToSet: '$domain' },
     499            numPagesInMRICount: { $sum: '$numPagesInMRI' },
     500            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
     501        }
     502    },
     503    { $sort : { count : -1} }
     504]);
     505
     506
     507Find pages for testing with:
     508    db.getCollection('Webpages').find({URL:/ashtangatauranga\.co\.nz/, containsMRI: true, mriSentenceCount: {$gt: 0}})
     509
     510
     511/* 1 */
     512{
     513    "_id" : "nz",
     514    "count" : 80.0,
     515    "domain" : [
     516X        "http://www.zoomin.co.nz", [map site, so placenames]
     517X        "http://www.biketorqueyamaha.co.nz", [placenames] {includes "http://biketorqueyamaha.co.nz"}
     518X        "http://archerpix.com", [photo captions containing placenames]
     519X        "http://philipbeadle.co.nz", [art captions containing placenames]
     520X        "https://2019.nethui.nz", [Just MRI words in ENG sentences]
     521X        "http://crimson.co.nz", [address]
     522+        "http://holyspirit.nz", (e.g. https://holyspirit.nz/wp-content/uploads/02-25-2018-Newsletter.pdf)
     523X        "https://www.wingspan.co.nz", [1 page, 1 conjoined word, looks like placename]
     524X        "http://nzpostcard.co.nz", [postcards with placenames]
     525+        "https://www.ngamanawainc.co.nz", (e.g. https://www.ngamanawainc.co.nz/history/id/60) {includes "http://www.ngamanawainc.co.nz"}
     526
     527+        "http://www.finlaysonpark.school.nz", [e.g. http://www.finlaysonpark.school.nz/58/pages/61-team-overview; Some Tongan language pages and Samoan pages]
     528X        "http://artizani.co.nz", [address]
     529+        "http://www.w3vietnam.org.nz", [e.g. http://www.w3vietnam.org.nz/w3%20mihi%20ki%20ka%20hoia.htm] (includes "http://w3vietnam.org.nz")
     530X        "https://sooty.nz", [names, war death notices, place names]
     531X?        "http://rakaumanga.school.nz", [School has a longer Maori title name, but that's it] {includes "http://www.rakaumanga.school.nz"}
     532X        "http://www.rotoruanz.com", [e.g. https://www.rotoruanz.com/RNZ/media/Media-Library/SOI-Board-version-08-June-2018-final-published-on-website.pdf]
     533X        "https://www.cruisetourstauranga.co.nz", [English with Tauranga being MRI placename]
     534X        "http://www.jeremybaker.nz", [one word, HOkio]
     535
     536X        "https://liveresults.co.nz", [canoe sports team names]
     537X        "http://rexedra.gen.nz", [ENG sentence with MRI words]
     538+        "https://www.takitimu.ac.nz", [school of performing arts, example MRI sentence at https://www.takitimu.ac.nz/about-us]
     539X        "http://www.electionresults.govt.nz", [placenames and misdetection of Return to Home Page] {includes https://www.electionresults.org.nz which seems to be an alias for the same} {includes "http://archive.electionresults.govt.nz"}
     540+        "https://kotahimiriona.co.nz", e.g. (https://kotahimiriona.co.nz/about-the-app/)
     541+        "https://rehuamarae.co.nz", (e.g. https://rehuamarae.co.nz/)
     542+        "http://reoora.co.nz", (e.g. https://reoora.co.nz/about/)
     543
     544X        "http://otorohanga.directorybusiness.co.nz", [placenames]
     545X        "http://waitarahistory.org.nz", [placenames and misdetection of sentences of the form "31, No 262." as MRI]
     546+        "https://manawatuheritage.pncc.govt.nz", (e.g. https://manawatuheritage.pncc.govt.nz/about)
     547+        "http://rsnz.natlib.govt.nz", [e.g. http://rsnz.natlib.govt.nz/volume/rsnz_26/rsnz_26_00_004560.html] NOTE: site appears to use Greenstone
     548X        "https://www.rotorua-rafting.co.nz", [placenames]
     549+        "https://www.taitokerautrust.org.nz", (e.g. https://www.taitokerautrust.org.nz/)
     550+        "http://tewikiotereomaori.nz", (e.g. https://tewikiotereomaori.nz/about/)
     551+        "https://www.korokikahukura.co.nz", (e.g. https://www.korokikahukura.co.nz/ko-wai-m257tou.html about the connection to the Waikato River)
     552
     553X        "https://www.puhaandpakeha.co.nz", [ENG sentences with MRI words]
     554X        "http://myfathersworld.net.nz", [placenames]
     555X        "https://www.ashtangatauranga.co.nz", [misdetection]
     556+        "https://www.pinterest.nz", (e.g. sentence "E tu Whare Rangi, e tu Whare Pataka, kia ora koutou nga Whare Tipuna o te Ao Maori." at https://www.pinterest.nz/amp/marylizw03/maori/)
     557+        "https://www.rereahu.maori.nz", (e.g. https://www.rereahu.maori.nz/uploads/72717/files/168860/Te_Maru_o_Rereahu_Newsletter_No_5_February_2011.pdf)
     558+        "http://givealittle.co.nz", (e.g. https://givealittle.co.nz/cause/rebuild-tapu-te-ranga-marae which contains greetings phrases and sentence ""Nā te ringa tangata i hanga te whare Nā te tuarā o te whare i whakatipu i te tangata")
     559X        "http://www.gans.co.nz", [placenames]
     560+        "https://kaiiwicamp.nz", [placenames] {includes "http://kaiiwicamp.nz"}
     561+        "http://ngarauhuia.ngatiapakiterato.iwi.nz", (e.g. http://ngarauhuia.ngatiapakiterato.iwi.nz/pdf/Nga-Rau-Huia_Raumati_2015_Issue1.pdf)
     562+        "https://m.wairarapatv.co.nz", (e.g. https://m.wairarapatv.co.nz/archive/i/JGmJqCZCNK8/te-ara-whanui-kura-kaupapa-maori-o-nga-khanga-reo-o-te-awa-kairangi)
     563
     564X        "http://www.methodist.org.nz", [ENG sentence with MRI words]
     565+        "http://avonside.net", (e.g. http://avonside.net/Nicky/indexmain.htm)
     566X        "http://www.ruralfind.co.nz", [placenames]
     567+        "http://www.maoriinvestments.co.nz", (e.g. Vision, Mission & Values tab of https://maoriinvestments.co.nz/about/organisation)
     568+        "http://conference.tpwt.maori.nz", (e.g. https://www.tpwt.maori.nz/te-mahe-matauranga/)
     569+        "https://www.puau.school.nz", (e.g. https://www.puau.school.nz/k%C4%81inga-home)
     570+?        "http://ngatiwhakaue.iwi.nz", (e.g. greetings at http://ngatiwhakaue.iwi.nz/)
     571X        "http://www.nzpcn.org.nz", [lots of misdetections and one sentence in English with MRI words, e.g. "Kaori is a Taonga to Maori".] {includes "http://dev.nzpcn.org.nz"}
     572+?        "https://interactives.stuff.co.nz", [2 x page mostly empty except for the title "TE WIKI O TE REO MĀORI Māori"]
     573+        "http://tehauora.org.nz", (e.g. http://tehauora.org.nz/ and proverb at bottom of http://tehauora.org.nz/about-us)
     574
     575+        "http://temahurehure.maori.nz", (e.g. greeting http://temahurehure.maori.nz/site/file/temp/hirebook.pdf)
     576X        "http://pukekohe.directorybusiness.co.nz", [placenames]
     577+!!      "http://www.temarareo.org", (dictionary with sample sentences. e.g. http://www.temarareo.org/PAPAKUPU/dictionary-searchresults/A-2.htm)
     578X        "https://www.tasteofplenty.co.nz", [Tauranga placename and misdetection of "No one went away hungry] {includes "http://www.tasteofplenty.co.nz"}
     579     
     580+        "http://www.tetaumuturunanga.iwi.nz", (e.g. http://www.tetaumuturunanga.iwi.nz/wp-content/uploads/2016/04/Malvern-Cultural-Narrative-Draft.pdf)
     581       
     582       
     583X        "https://www.blushandbrows.nz", [misdetection of "Makeup..."]
     584X        "http://talkingtothecan.com", [misdetection of things like ""22, no." and mistaking ENG sentences with MRI words]
     585
     586+?       "http://whatonga.school.nz", [school title]
     587+?       "https://player.vimeo.com", [Video titles contain MRI sentences and MRI titles interspersed in ENG sentence. Video's audio content may be in MRI]
     588+        "http://www.writersfestival.co.nz", (e.g. http://www.writersfestival.co.nz/news/Page1/urgently-relevant-novel-wins-countrys-richest-literary-award/)
     589+?       "http://southerntribes.co.nz", [brazilian jiu jitsu site with a greeting in MRI on main page]
     590+        "http://www.kmk.maori.nz", (e.g. http://www.kmk.maori.nz/kmk-events)
     591+        "https://www.stats.govt.nz", (e.g. http://archive.stats.govt.nz/Census/2001-census-data/2001-census-pacific-profiles/cook-island-maori-people-in-new-zealand.aspx)
     592X        "http://www.eventcinemas.co.nz", [placenames and misdetection of ENG "Atura Hotels"]
     593X        "https://www.zenbu.co.nz" [misdetection and NZ school addresses]
     594    ],
     595    "numPagesInMRICount" : 0,
     596    "numPagesContainingMRICount" : 1673
     597}
Note: See TracChangeset for help on using the changeset viewer.