Changeset 33890
- Timestamp:
- 2020-02-03T20:31:33+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json
r33884 r33890 199 199 "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages] 200 200 "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence] 201 !! "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRIsong titles] 0 [no other pages containsMRI]202 "http://maori.livingheritage.org.nz", 2/2 2/2 201 X!! "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI album and song titles] 0 [no other pages containsMRI] 202 "http://maori.livingheritage.org.nz", 2/2 2/2 {includes: http://www.livingheritage.org.nz} 203 203 "http://pukoro.co.nz", 2/2 0/2 204 "https://register.tpota.org.nz", 0/1 [form] 0/2205 X "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz", 0/4, 1/3 [but audio content may be in MRI] 204 X "https://register.tpota.org.nz", 0/1 [form] 0/2 205 + "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz", 0/4, 1/3 [but audio content may be in MRI] But there are pages containing MRI to be found by non-random sampling of tehiku.nz, e.g. https://tehiku.nz/te-hiku-radio/te-tangihanga-ki-a-erima-henare/ contains MRI sentences 206 206 !! "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages] 207 207 ! "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3 … … 211 211 "http://teaohou.natlib.govt.nz", 4/4, 2/4 212 212 "http://www.tuwharetoa.iwi.nz", 2/3 0/3 213 X "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY 213 + "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html) 214 214 "https://www.terito.school.nz", 3/3, 0/2 total 215 215 "https://ttw1.cwp.govt.nz", 3/3 3/3 … … 228 228 229 229 "http://anglicanprayerbook.nz", 3/3 3/3 230 "http://arataua.nz", 4/4, 2/3 231 "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz] 230 "http://arataua.nz", 4/4, 2/3 232 231 "http://maori.tki.org.nz", 3/3 3/3 233 232 DONE (with/out www): "http://www.firstworldwar.tki.org.nz", … … 236 235 "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages] 237 236 "https://curriculumtool.education.govt.nz", 4/4, 3/3 238 "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] 239 "http://kete.wcl.govt.nz", 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3 237 "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] {includes: "http://www.kurakokiri.maori.nz"} 240 238 "http://www.kkmmaungarongo.co.nz", 3/3, 3/3 241 239 "http://www.heartland.co.nz", 3/3, 1/1 total 242 240 "http://oilcrash.com", 2/2 total, 0/3 243 "http://www.kura-porirua.school.nz", 4/4, 2/3 244 "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] 241 "http://www.kura-porirua.school.nz", 4/4, 2/3 245 242 "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages] 246 243 "https://www.tematawai.maori.nz", 3/3, 3/3 247 244 248 "https://www.terakipaewhenua.school.nz",249 "http://www.tetaurawhiri.govt.nz",250 "http://archive.stats.govt.nz", 251 "http://tiritiowaitangi.govt.nz",252 "http://www.waiata.maori.nz", 253 "http://hana.co.nz", 254 "http://kaupare.co.nz",255 "http://www.tereowrap.nz",256 "https://www.e-agent.nz", 257 "http://www.hrc.co.nz", 258 "http://ngatiporoukiponeke.org.nz",259 "http://rurued.school.nz",260 "http://www.twtop.school.nz", 261 "https://www.infinite-electronic.nz",262 "http://www.huri-translations.pf",263 "https://admin.teara.govt.nz", 264 "https://tiritiowaitangi.govt.nz",265 "http://www.tmoa.tki.org.nz", 266 "https://www.komako.org.nz",267 "http://www.wcl.govt.nz",268 "https://office.e-agent.nz", 269 "http://punareo.co.nz",270 "http://www.kurakokiri.maori.nz",271 "https://rapuatearatika.education.govt.nz", 272 "http://tmmkkm.school.nz",273 "https://www.components-mart.nz",274 "http://www.cs.waikato.ac.nz", 275 "http://www.kupengahao.co.nz", 276 "https://www.hapuhauora.health.nz", 277 "https://www.lcds-display.nz", 278 "http://waiata.maori.nz",279 "http://cms.sunsmartschools.co.nz",280 "http://www.livingheritage.org.nz",281 "http://kuraproductions.co.nz", 282 "https://keepourmoneyclean.govt.nz", 283 284 "http://www.tkkmmokopuna.school.nz", 285 "http://hangaraumatihiko.tki.org.nz", 286 "http://www.pakanae.maori.nz"245 + "https://www.terakipaewhenua.school.nz", 246 + "http://www.tetaurawhiri.govt.nz", 247 + "http://archive.stats.govt.nz", (1 page isMRI) 248 + "http://tiritiowaitangi.govt.nz", 249 +!! "http://www.waiata.maori.nz", {includes: "http://waiata.maori.nz"} 250 + "http://hana.co.nz", [crawled version of page contains MRI sentences, but current page is labelled for mobiles whereas in browser it just has a picture] 251 + "http://kaupare.co.nz", 252 + "http://www.tereowrap.nz", 253 ?X "https://www.e-agent.nz", [autotranslated? SEO related site in NZ, Chinese, English and MRI] {includes: "https://office.e-agent.nz"} 254 { included: "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] } 255 + "http://www.hrc.co.nz", 256 + "http://ngatiporoukiponeke.org.nz", 257 258 + "http://rurued.school.nz", 259 + "http://www.twtop.school.nz", 260 X "https://www.infinite-electronic.nz", [autotranslated product site] 261 +!! "http://www.huri-translations.pf", 262 + "https://admin.teara.govt.nz", {included: "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz]} 263 +!! "https://tiritiowaitangi.govt.nz", 264 + "http://www.tmoa.tki.org.nz", 265 + "https://www.komako.org.nz", [no longer available, but crawled page had a paragraph in Maori with presumably a translation thereafter] 266 + "http://www.wcl.govt.nz", {included: "http://kete.wcl.govt.nz" as wcl.govt.nz; 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3} 267 +!! "http://punareo.co.nz", [waiata] 268 269 + "https://rapuatearatika.education.govt.nz", 270 + "http://tmmkkm.school.nz", 271 X "https://www.components-mart.nz", [autotranslated product site] 272 + "http://www.cs.waikato.ac.nz", [Te Taka's pages!] 273 +!!! "http://www.kupengahao.co.nz", [MRI language books and resources] 274 + "https://www.hapuhauora.health.nz", [Smokefree site. The one isMRI page has a proper sentence.] 275 X "https://www.lcds-display.nz", [autotranslated product site] 276 + "http://cms.sunsmartschools.co.nz", [as sunsmartschools.co.nz, e.g. http://sunsmartschools.co.nz/kura/how-to-become.html] 277 + "http://kuraproductions.co.nz", 278 + "https://keepourmoneyclean.govt.nz", [1 page] 279 280 +!! "http://www.tekura.school.nz", 281 + "http://www.tkkmmokopuna.school.nz", [e.g. http://www.tkkmmokopuna.school.nz/newsletter_sets/newsletters/13-wahanga-4-wiki-5-he-puna-korero] 282 + "http://hangaraumatihiko.tki.org.nz", [govt website. e.g. http://hangaraumatihiko.tki.org.nz/te-tupuranga-tangata-me-te-rorohiko/whakatupuranga-3-te-whakatau-tikanga-kawe-i-nga-mahi-whakaoti-hopanga/] 283 + "http://www.pakanae.maori.nz" 287 284 ], 288 285 "numPagesInMRICount" : 4360, … … 290 287 } 291 288 289 290 96 sites detected as having isMRI pages - 7 sites/subdomains already included in the existing 96 = 89 sites. 291 292 -2.5* product sites -2 non-MRI sites with songlistings or forms etc 293 *0.5 for e-agent.nz site 294 = 84.5 sites total that at least contain MRI, most have pages inMRI. 292 295 ---------------------------- 293 296 … … 474 477 475 478 479 ---------------------------- 480 481 The remainder: 80 NZ sites detected as not containing pages InMRI, but with positive number of pages detected as containsMRI: 482 483 db.Websites.aggregate([ 484 { 485 $match: { 486 $and: [ 487 {numPagesContainingMRI: {$gt: 0}}, 488 {numPagesInMRI: {$eq: 0}}, 489 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} 490 ] 491 } 492 }, 493 { $unwind: "$geoLocationCountryCode" }, 494 { 495 $group: { 496 _id: "nz", 497 count: { $sum: 1 }, 498 domain: { $addToSet: '$domain' }, 499 numPagesInMRICount: { $sum: '$numPagesInMRI' }, 500 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } 501 } 502 }, 503 { $sort : { count : -1} } 504 ]); 505 506 507 Find pages for testing with: 508 db.getCollection('Webpages').find({URL:/ashtangatauranga\.co\.nz/, containsMRI: true, mriSentenceCount: {$gt: 0}}) 509 510 511 /* 1 */ 512 { 513 "_id" : "nz", 514 "count" : 80.0, 515 "domain" : [ 516 X "http://www.zoomin.co.nz", [map site, so placenames] 517 X "http://www.biketorqueyamaha.co.nz", [placenames] {includes "http://biketorqueyamaha.co.nz"} 518 X "http://archerpix.com", [photo captions containing placenames] 519 X "http://philipbeadle.co.nz", [art captions containing placenames] 520 X "https://2019.nethui.nz", [Just MRI words in ENG sentences] 521 X "http://crimson.co.nz", [address] 522 + "http://holyspirit.nz", (e.g. https://holyspirit.nz/wp-content/uploads/02-25-2018-Newsletter.pdf) 523 X "https://www.wingspan.co.nz", [1 page, 1 conjoined word, looks like placename] 524 X "http://nzpostcard.co.nz", [postcards with placenames] 525 + "https://www.ngamanawainc.co.nz", (e.g. https://www.ngamanawainc.co.nz/history/id/60) {includes "http://www.ngamanawainc.co.nz"} 526 527 + "http://www.finlaysonpark.school.nz", [e.g. http://www.finlaysonpark.school.nz/58/pages/61-team-overview; Some Tongan language pages and Samoan pages] 528 X "http://artizani.co.nz", [address] 529 + "http://www.w3vietnam.org.nz", [e.g. http://www.w3vietnam.org.nz/w3%20mihi%20ki%20ka%20hoia.htm] (includes "http://w3vietnam.org.nz") 530 X "https://sooty.nz", [names, war death notices, place names] 531 X? "http://rakaumanga.school.nz", [School has a longer Maori title name, but that's it] {includes "http://www.rakaumanga.school.nz"} 532 X "http://www.rotoruanz.com", [e.g. https://www.rotoruanz.com/RNZ/media/Media-Library/SOI-Board-version-08-June-2018-final-published-on-website.pdf] 533 X "https://www.cruisetourstauranga.co.nz", [English with Tauranga being MRI placename] 534 X "http://www.jeremybaker.nz", [one word, HOkio] 535 536 X "https://liveresults.co.nz", [canoe sports team names] 537 X "http://rexedra.gen.nz", [ENG sentence with MRI words] 538 + "https://www.takitimu.ac.nz", [school of performing arts, example MRI sentence at https://www.takitimu.ac.nz/about-us] 539 X "http://www.electionresults.govt.nz", [placenames and misdetection of Return to Home Page] {includes https://www.electionresults.org.nz which seems to be an alias for the same} {includes "http://archive.electionresults.govt.nz"} 540 + "https://kotahimiriona.co.nz", e.g. (https://kotahimiriona.co.nz/about-the-app/) 541 + "https://rehuamarae.co.nz", (e.g. https://rehuamarae.co.nz/) 542 + "http://reoora.co.nz", (e.g. https://reoora.co.nz/about/) 543 544 X "http://otorohanga.directorybusiness.co.nz", [placenames] 545 X "http://waitarahistory.org.nz", [placenames and misdetection of sentences of the form "31, No 262." as MRI] 546 + "https://manawatuheritage.pncc.govt.nz", (e.g. https://manawatuheritage.pncc.govt.nz/about) 547 + "http://rsnz.natlib.govt.nz", [e.g. http://rsnz.natlib.govt.nz/volume/rsnz_26/rsnz_26_00_004560.html] NOTE: site appears to use Greenstone 548 X "https://www.rotorua-rafting.co.nz", [placenames] 549 + "https://www.taitokerautrust.org.nz", (e.g. https://www.taitokerautrust.org.nz/) 550 + "http://tewikiotereomaori.nz", (e.g. https://tewikiotereomaori.nz/about/) 551 + "https://www.korokikahukura.co.nz", (e.g. https://www.korokikahukura.co.nz/ko-wai-m257tou.html about the connection to the Waikato River) 552 553 X "https://www.puhaandpakeha.co.nz", [ENG sentences with MRI words] 554 X "http://myfathersworld.net.nz", [placenames] 555 X "https://www.ashtangatauranga.co.nz", [misdetection] 556 + "https://www.pinterest.nz", (e.g. sentence "E tu Whare Rangi, e tu Whare Pataka, kia ora koutou nga Whare Tipuna o te Ao Maori." at https://www.pinterest.nz/amp/marylizw03/maori/) 557 + "https://www.rereahu.maori.nz", (e.g. https://www.rereahu.maori.nz/uploads/72717/files/168860/Te_Maru_o_Rereahu_Newsletter_No_5_February_2011.pdf) 558 + "http://givealittle.co.nz", (e.g. https://givealittle.co.nz/cause/rebuild-tapu-te-ranga-marae which contains greetings phrases and sentence ""NÄ te ringa tangata i hanga te whare NÄ te tuarÄ o te whare i whakatipu i te tangata") 559 X "http://www.gans.co.nz", [placenames] 560 + "https://kaiiwicamp.nz", [placenames] {includes "http://kaiiwicamp.nz"} 561 + "http://ngarauhuia.ngatiapakiterato.iwi.nz", (e.g. http://ngarauhuia.ngatiapakiterato.iwi.nz/pdf/Nga-Rau-Huia_Raumati_2015_Issue1.pdf) 562 + "https://m.wairarapatv.co.nz", (e.g. https://m.wairarapatv.co.nz/archive/i/JGmJqCZCNK8/te-ara-whanui-kura-kaupapa-maori-o-nga-khanga-reo-o-te-awa-kairangi) 563 564 X "http://www.methodist.org.nz", [ENG sentence with MRI words] 565 + "http://avonside.net", (e.g. http://avonside.net/Nicky/indexmain.htm) 566 X "http://www.ruralfind.co.nz", [placenames] 567 + "http://www.maoriinvestments.co.nz", (e.g. Vision, Mission & Values tab of https://maoriinvestments.co.nz/about/organisation) 568 + "http://conference.tpwt.maori.nz", (e.g. https://www.tpwt.maori.nz/te-mahe-matauranga/) 569 + "https://www.puau.school.nz", (e.g. https://www.puau.school.nz/k%C4%81inga-home) 570 +? "http://ngatiwhakaue.iwi.nz", (e.g. greetings at http://ngatiwhakaue.iwi.nz/) 571 X "http://www.nzpcn.org.nz", [lots of misdetections and one sentence in English with MRI words, e.g. "Kaori is a Taonga to Maori".] {includes "http://dev.nzpcn.org.nz"} 572 +? "https://interactives.stuff.co.nz", [2 x page mostly empty except for the title "TE WIKI O TE REO MÄORI MÄori"] 573 + "http://tehauora.org.nz", (e.g. http://tehauora.org.nz/ and proverb at bottom of http://tehauora.org.nz/about-us) 574 575 + "http://temahurehure.maori.nz", (e.g. greeting http://temahurehure.maori.nz/site/file/temp/hirebook.pdf) 576 X "http://pukekohe.directorybusiness.co.nz", [placenames] 577 +!! "http://www.temarareo.org", (dictionary with sample sentences. e.g. http://www.temarareo.org/PAPAKUPU/dictionary-searchresults/A-2.htm) 578 X "https://www.tasteofplenty.co.nz", [Tauranga placename and misdetection of "No one went away hungry] {includes "http://www.tasteofplenty.co.nz"} 579 580 + "http://www.tetaumuturunanga.iwi.nz", (e.g. http://www.tetaumuturunanga.iwi.nz/wp-content/uploads/2016/04/Malvern-Cultural-Narrative-Draft.pdf) 581 582 583 X "https://www.blushandbrows.nz", [misdetection of "Makeup..."] 584 X "http://talkingtothecan.com", [misdetection of things like ""22, no." and mistaking ENG sentences with MRI words] 585 586 +? "http://whatonga.school.nz", [school title] 587 +? "https://player.vimeo.com", [Video titles contain MRI sentences and MRI titles interspersed in ENG sentence. Video's audio content may be in MRI] 588 + "http://www.writersfestival.co.nz", (e.g. http://www.writersfestival.co.nz/news/Page1/urgently-relevant-novel-wins-countrys-richest-literary-award/) 589 +? "http://southerntribes.co.nz", [brazilian jiu jitsu site with a greeting in MRI on main page] 590 + "http://www.kmk.maori.nz", (e.g. http://www.kmk.maori.nz/kmk-events) 591 + "https://www.stats.govt.nz", (e.g. http://archive.stats.govt.nz/Census/2001-census-data/2001-census-pacific-profiles/cook-island-maori-people-in-new-zealand.aspx) 592 X "http://www.eventcinemas.co.nz", [placenames and misdetection of ENG "Atura Hotels"] 593 X "https://www.zenbu.co.nz" [misdetection and NZ school addresses] 594 ], 595 "numPagesInMRICount" : 0, 596 "numPagesContainingMRICount" : 1673 597 }
Note:
See TracChangeset
for help on using the changeset viewer.