source: other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json

Last change on this file was 33891, checked in by ak19, 4 years ago

Site level detected vs manual inspected data: working shown in file ManualShortlisting.txt, and summarised in LibreOffice table TableOfNumDetectedVsManualSITESWithMRI.ods and image 8table_siteCountSummary.png

File size: 29.7 KB
Line 
1/*
2
3db.Websites.aggregate([
4 {
5 $match: {
6 $and: [
7 {numPagesInMRI: {$gt: 0}},
8 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
9 ]
10 }
11 },
12 { $unwind: "$geoLocationCountryCode" },
13 {
14 $group: {
15 _id: "nz",
16 count: { $sum: 1 },
17 domain: { $addToSet: '$domain' },
18 numPagesInMRICount: { $sum: '$numPagesInMRI' },
19 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
20 }
21 },
22 { $sort : { count : -1} }
23]);
24
25For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted.
26
27For all but NZ, get final column results with:
28 db.getCollection('Websites').find({domain:/coggle\.it/})
29And can check for URLs with:
30 db.getCollection('Webpages').find({URL: /coggle\.it/, isMRI: true})
31
32
33NOTES:
341. DE:
35
36"de","2.0","0+1","9+35 misdetected", http://www.cartogiraffe.com, https://www.cartogiraffe.com,
37Ought to be 2+2 numPagesInMRICount and 9+2 numPagesContainingMRICount:
38- both cartogiraffe.com pages were identical and had mostly MRI sentences with one name not being MRI. So isMRI should have been true for both pages.
39- Only one of the 2 MRI translations of the universal declaration of human rights at http://www.udhr.de got downloaded. A total of 75 pages were downloaded, but more translated pages appeared to be on the webpage. Not sure why the crawl had a _SUCCESS file to indicate completed download.
40- Then http://www.udhr.de had 35-1 non-MRI language translations of the universal declaration of human rights where one or more sentences were misdetected as MRI. With the additional MRI page that didn't get downloaded, should have 9+2 = 11 pages containing MRI.
41
42So instead of
43"de","2.0","1","44", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
44"de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
45
46
47"au","3.0",7+0+1,83+1+3,https://www.kiwiproperty.com, https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd,https://koreromaori.com
48
492. US:
50aclhokiangarocks.blogspot.com contains at least a page with MRI paragraphs. See http://aclhokiangarocks.blogspot.com/feeds/posts/default under section "Nga Tuhinga o tatou Tupuna"
51Although this page has been crawled by Nutch, the contents were presented in the blog in a complex way and therefore the text wasn't retrieved here. See also the dedicated page this text should have been in http://aclhokiangarocks.blogspot.com/2012/05/nga-tuhinga-o-tatou-tupuna.html
52
53"_id","siteCount","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
54"nz","176.0" containsMRI vs 96 pages inMRI,"4360","9641" in 176 containsMRI pages vs 7968 in isMRI pages
55"us","29.0",
56 1+2+0+0+4+166+0+39 +257+2+21+12+25+13+53+0+1+0+1+11 +32+37+4 +0+0+0 = 681,
57 31+2+2+20+58+166+3+91 +258+2+25+12+66+22+53+6+1+1+2+10 +58+54+6 +1+2+1 = 953,
58 anglicanhistory.org,unicode.org,static-promote.weebly.com,aclhokiangarocks.blogspot.com,bahaiprayers.net,biblehub.com,muhammad.com,godrules.net,m.biblepub.com, krassotkin.ru,gotquestions.org,
59 maorinews.com,maaori.com,kiaorahola.blogspot.com,kjohnsonnz.blogspot.com,pumanawawhangara.blogspot.com,dannykahei.tripod.com,burkekm001.tripod.com,tkkpipipaopao.blogspot.com, manateina.blogspot.com,
60 tatai09.blogspot.com,twttoa.com,tuhua2010.blogspot.com,
61 breaker.audio,drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview,in.pinterest.com/pin/317363104978423418/
62"au","2.0","8","86", https://www.kiwiproperty.com, https://koreromaori.com
63"de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
64"dk","2.0","4","7", *.ngapuhitelevision.com, *.ngapuhiradio.com
65"bg","1.0","2","2", http://anitra.net/activism/humanrights/UDHR/mbf_print.htm, http://anitra.net/activism/humanrights/UDHR/rrt_print.htm
66"cz","1.0","0","1", http://www.henryklahola.nazory.cz/094.Maori.htm, http://henryklahola.nazory.cz/094.Maori.htm
67"es","1.0","1","1", https://www.uv.es/~pla/red.net/intmaori.html
68"fr","1.0","1","1", http://chantsdeluttes.free.fr/versionsinter/page%20maori.html
69"ie","1.0","1","3", https://coggle.it/diagram/WSYB0mLA2QABD5BH/t/ko-au-ko-koe
70
71
72--------------
73
74 https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/#CI1
75 https://stats.stackexchange.com/questions/207584/sample-size-choice-with-binary-outcome
76 https://www.statisticshowto.datasciencecentral.com/z-alpha2-za2/
77
78 N (NZ pages where isMRI comes out true) = 4360
79 solving for n, the sample size
80 confidence level = 90%
81 m, margin of error = 5%
82
83 From the "z alpha/2" table, for 90% confidence, we get a z alpha/2 value of 1.6449 (or 1.645).
84
85 Then the sample size, n, we need is = 1.6449^2 * 4360 / ( 1.6449^2 + (4 * 4359) * 0.05^2) = 255 (rounded up)
86
87
88 For N = 681,
89 sample size n is = 1.6449^2 * 681 / ( 1.6449^2 + (4 * 680) * 0.05^2) = 194 (rounded up)
90
91
92 sample size for NZ: 255 (90% confidence with 5% margine of error, Including a finite correction factor)
93 sample size for US: 194
94
95*/
96
97
98// To add column: "URLs of pages detected as inMRI"
99"_id","siteCount containsMRI","numPagesInMRICount","numPagesContainingMRICount"
100X"nz","176.0","4360","9641"
101"nz", "166.0", "?", "?"
102"us","29.0","681","953"
103"au","2.0","8","86"
104"de","2.0","4","11"
105"dk","2.0","4","7"
106"bg","1.0","2","2"
107"cz","1.0","0","1"
108"es","1.0","1","1"
109"fr","1.0","1","1"
110"ie","1.0","1","3"
111
112Total sites containing MRI: 216
113Total num sites detected as containing MRI: 868 [db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()]
114[of which 96 isMRI sites from NZ]
115Total pages detected as being in MRI: 5062
116Total pages detected as containing MRI sentences: 10706
117
118
119
120NZ - sample 255 pages from:
121/*
122db.Websites.aggregate([
123 {
124 $match: {
125 $and: [
126 {numPagesContainingMRI: {$gt: 0}},
127 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
128 ]
129 }
130 },
131 { $unwind: "$geoLocationCountryCode" },
132 {
133 $group: {
134 _id: "nz",
135 count: { $sum: 1 },
136 domain: { $addToSet: '$domain' },
137 numPagesInMRICount: { $sum: '$numPagesInMRI' },
138 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
139 }
140 },
141 { $sort : { count : -1} }
142]);
143
144
145OR is this better (only numPagesINMRI):
146
147db.Websites.aggregate([
148 {
149 $match: {
150 $and: [
151 {numPagesInMRI: {$gt: 0}},
152 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
153 ]
154 }
155 },
156 { $unwind: "$geoLocationCountryCode" },
157 {
158 $group: {
159 _id: "nz",
160 count: { $sum: 1 },
161 domain: { $addToSet: '$domain' },
162 numPagesInMRICount: { $sum: '$numPagesInMRI' },
163 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
164 }
165 },
166 { $sort : { count : -1} }
167]);
168*/
169
170num NZ sites with > 0 isMRI pages = 96
171Total numPagesInMRI in NZ sites = 4360
172Total numPagesContainingMRI in NZ sites = 7968
173
174Using the results you get a list of domains that matched. 171 nz domains, though it should be 176? -1
175
176Copy each domain (up to 255 of them) and look for the first 1 or 2 max that matches isMRI:
177
1781. db.getCollection('Webpages').find({URL:/pukekohe.directorybusiness.co.nz/, isMRI: true}) - check it contains a positive number of pages in MRI and check the first 1-2 pages to make sure they are indeed in MRI. Note down the ratio of MRI finds. e.g. 2/2.
179
1802. Find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI. Note down the ratio for the first 2 pages.
181db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true})
182
183
184First column: n pages that are in MRI / n sampled isMRI pages
185Second column: n pages that do contain MRI / n sampled pages that are not isMRI yet contain MRI
186
187/* 1 */
188{
189 "_id" : "nz",
190 "count" : 96.0,
191 "domain" : [
192 "http://www.teipukarea.maori.nz", 3/3 1/3
193 "http://ngatipahauwera.co.nz", 2/2, 2/2
194 "http://www.oag.govt.nz", 2/2 0/2
195 "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3
196 "http://tmoa.tki.org.nz", 3/3 3/3
197 "http://www.tewhanake.maori.nz", 3/3 2/3
198 "http://www.matarikifestival.org.nz", 4/4 0/3
199 "http://www.otepoti.school.nz", 3/3 0/4
200!! "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages]
201 "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages]
202 "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence]
203X!! "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI album and song titles] 0 [no other pages containsMRI]
204 "http://maori.livingheritage.org.nz", 2/2 2/2 {includes: http://www.livingheritage.org.nz}
205 "http://pukoro.co.nz", 2/2 0/2
206X "https://register.tpota.org.nz", 0/1 [form] 0/2
207+ "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz", 0/4, 1/3 [but audio content may be in MRI] But there are pages containing MRI to be found by non-random sampling of tehiku.nz, e.g. https://tehiku.nz/te-hiku-radio/te-tangihanga-ki-a-erima-henare/ contains MRI sentences
208!! "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages]
209! "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3
210 "http://kurataiao.tki.org.nz", 3/3, 1/total 3
211
212!! "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages]
213 "http://teaohou.natlib.govt.nz", 4/4, 2/4
214 "http://www.tuwharetoa.iwi.nz", 2/3 0/3
215+ "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html)
216 "https://www.terito.school.nz", 3/3, 0/2 total
217 "https://ttw1.cwp.govt.nz", 3/3 3/3
218 "https://www.whanau-tahi.school.nz", 4/4, 1/2 total
219 "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total
220 "https://teaomaori.news", 3/3, 0/1 total
221 "http://tetaurawhiri.govt.nz", 3/3 /3/3 [Māori Language Commission site]
222 "https://www.tuiatematangi.ac.nz", 4/4 3/3
223 "http://animations.tewhanake.maori.nz", 3/3 3/3
224!! "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages]
225!! "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages]
226 "http://www.28maoribattalion.org.nz", 3/3, 1/3
227 "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3
228 "http://www.brettgraham.co.nz", 1/1 total, 0/3
229!! "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages]
230
231 "http://anglicanprayerbook.nz", 3/3 3/3
232 "http://arataua.nz", 4/4, 2/3
233 "http://maori.tki.org.nz", 3/3 3/3
234DONE (with/out www): "http://www.firstworldwar.tki.org.nz",
235X "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages]
236 "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages]
237 "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages]
238 "https://curriculumtool.education.govt.nz", 4/4, 3/3
239 "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] {includes: "http://www.kurakokiri.maori.nz"}
240 "http://www.kkmmaungarongo.co.nz", 3/3, 3/3
241 "http://www.heartland.co.nz", 3/3, 1/1 total
242 "http://oilcrash.com", 2/2 total, 0/3
243 "http://www.kura-porirua.school.nz", 4/4, 2/3
244 "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages]
245 "https://www.tematawai.maori.nz", 3/3, 3/3
246
247+ "https://www.terakipaewhenua.school.nz",
248+ "http://www.tetaurawhiri.govt.nz",
249+ "http://archive.stats.govt.nz", (1 page isMRI)
250+ "http://tiritiowaitangi.govt.nz",
251+!! "http://www.waiata.maori.nz", {includes: "http://waiata.maori.nz"}
252+ "http://hana.co.nz", [crawled version of page contains MRI sentences, but current page is labelled for mobiles whereas in browser it just has a picture]
253+ "http://kaupare.co.nz",
254+ "http://www.tereowrap.nz",
255?X "https://www.e-agent.nz", [autotranslated? SEO related site in NZ, Chinese, English and MRI] {includes: "https://office.e-agent.nz"}
256 { included: "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] }
257+ "http://www.hrc.co.nz",
258+ "http://ngatiporoukiponeke.org.nz",
259
260+ "http://rurued.school.nz",
261+ "http://www.twtop.school.nz",
262X "https://www.infinite-electronic.nz", [autotranslated product site]
263+!! "http://www.huri-translations.pf",
264+ "https://admin.teara.govt.nz", {included: "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz]}
265+!! "https://tiritiowaitangi.govt.nz",
266+ "http://www.tmoa.tki.org.nz",
267+ "https://www.komako.org.nz", [no longer available, but crawled page had a paragraph in Maori with presumably a translation thereafter]
268+ "http://www.wcl.govt.nz", {included: "http://kete.wcl.govt.nz" as wcl.govt.nz; 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3}
269+!! "http://punareo.co.nz", [waiata]
270
271+ "https://rapuatearatika.education.govt.nz",
272+ "http://tmmkkm.school.nz",
273X "https://www.components-mart.nz", [autotranslated product site]
274+ "http://www.cs.waikato.ac.nz", [Te Taka's pages!]
275+!!! "http://www.kupengahao.co.nz", [MRI language books and resources]
276+ "https://www.hapuhauora.health.nz", [Smokefree site. The one isMRI page has a proper sentence.]
277X "https://www.lcds-display.nz", [autotranslated product site]
278+ "http://cms.sunsmartschools.co.nz", [as sunsmartschools.co.nz, e.g. http://sunsmartschools.co.nz/kura/how-to-become.html]
279+ "http://kuraproductions.co.nz",
280+ "https://keepourmoneyclean.govt.nz", [1 page]
281
282+!! "http://www.tekura.school.nz",
283+ "http://www.tkkmmokopuna.school.nz", [e.g. http://www.tkkmmokopuna.school.nz/newsletter_sets/newsletters/13-wahanga-4-wiki-5-he-puna-korero]
284+ "http://hangaraumatihiko.tki.org.nz", [govt website. e.g. http://hangaraumatihiko.tki.org.nz/te-tupuranga-tangata-me-te-rorohiko/whakatupuranga-3-te-whakatau-tikanga-kawe-i-nga-mahi-whakaoti-hopanga/]
285+ "http://www.pakanae.maori.nz"
286 ],
287 "numPagesInMRICount" : 4360,
288 "numPagesContainingMRICount" : 7968
289}
290
291
29296 sites detected as having isMRI pages - 7 sites/subdomains already included in the existing 96 = 89 sites.
293
294-2.5* product sites -2 non-MRI sites with songlistings or forms etc
295 *0.5 for e-agent.nz site
296= 84.5 sites total that at least contain MRI, most have pages inMRI.
297----------------------------
298
299/* 1 */
300{
301 "_id" : "nz",
302 "count" : 176.0,
303 "domain" : [
304!! "http://pukekohe.directorybusiness.co.nz", 0/2, 0/2, isMRI = 0!!
305 "http://maori.livingheritage.org.nz", 2/2 2/2
306 "http://pukoro.co.nz", 2/2 0/2
307 "http://www.rakaumanga.school.nz", 0/4 0/4
308 "http://www.ngamanawainc.co.nz", 0/2 0/2
309 "https://office.e-agent.nz",
310 "https://www.components-mart.nz",
311 "http://tmmkkm.school.nz",
312 "http://www.rotoruanz.com",
313 "http://www.huri-translations.pf",
314 "https://admin.teara.govt.nz",
315 "http://hangaraumatihiko.tki.org.nz",
316 "https://sexualviolence.victimsinfo.govt.nz",
317 "http://www.tekura.school.nz",
318 "http://philipbeadle.co.nz",
319 "http://www.cs.waikato.ac.nz",
320 "https://www.hapuhauora.health.nz",
321 "http://cms.sunsmartschools.co.nz",
322 "https://keepourmoneyclean.govt.nz",
323 "http://www.kura-porirua.school.nz",
324 "http://waitarahistory.org.nz",
325 "http://oilcrash.com",
326 "http://videos.e-agent.nz",
327 "https://manawatuheritage.pncc.govt.nz",
328 "https://www.terakipaewhenua.school.nz",
329 "http://dev.nzpcn.org.nz",
330 "https://kotahimiriona.co.nz",
331 "http://kurakokiri.maori.nz",
332 "https://www.sporty.co.nz",
333 "http://kaupare.co.nz",
334 "http://ngatiporoukiponeke.org.nz",
335 "https://www.takitimu.ac.nz",
336 "http://www.tetaurawhiri.govt.nz",
337 "http://www.waiata.maori.nz",
338 "http://conference.tpwt.maori.nz",
339 "http://ngatiwhakaue.iwi.nz",
340 "http://www.nzpcn.org.nz",
341 "http://www.ruralfind.co.nz",
342 "https://www.dnc.org.nz",
343 "https://www.puau.school.nz",
344 "https://kaiiwicamp.nz",
345 "https://www.terito.school.nz",
346 "https://www.pinterest.nz",
347 "https://e-ako-pangarau.nzmaths.co.nz",
348 "http://givealittle.co.nz",
349 "https://teaomaori.news",
350 "https://www.korokikahukura.co.nz",
351 "http://myfathersworld.net.nz",
352 "http://www.firstworldwar.tki.org.nz",
353 "https://www.ashtangatauranga.co.nz",
354 "http://biketorqueyamaha.co.nz",
355 "https://www.rereahu.maori.nz",
356 "http://www.tewikiotereomaori.co.nz",
357 "http://www.brettgraham.co.nz",
358 "http://tewikiotereomaori.nz",
359 "http://anglicanprayerbook.nz",
360 "http://arataua.nz",
361 "http://blog.teara.govt.nz",
362 "http://www.otepoti.school.nz",
363 "http://www.kmk.maori.nz",
364 "http://www.eventcinemas.co.nz",
365 "https://www.stats.govt.nz",
366 "http://www.oag.govt.nz", 2/2 0/2
367 "http://whatonga.school.nz",
368 "http://www.tewhanake.maori.nz",
369 "https://www.maoritelevision.com",
370 "http://kuraaiwi.maori.nz",
371 "http://kurataiao.tki.org.nz",
372 "http://teaohou.natlib.govt.nz",
373 "http://www.tetaumuturunanga.iwi.nz",
374 "http://www.tasteofplenty.co.nz",
375 "http://community.nzdl.org",
376 "https://www.blushandbrows.nz",
377 "https://register.tpota.org.nz",
378 "https://cdn.tehiku.nz",
379 "http://www.wcl.govt.nz",
380 "http://www.jeremybaker.nz",
381 "http://punareo.co.nz",
382 "https://rapuatearatika.education.govt.nz",
383 "http://www.kurakokiri.maori.nz",
384 "https://www.cruisetourstauranga.co.nz",
385 "https://sooty.nz",
386 "http://rakaumanga.school.nz",
387 "https://tiritiowaitangi.govt.nz",
388 "http://www.tmoa.tki.org.nz",
389 "http://www.w3vietnam.org.nz",
390 "https://www.infinite-electronic.nz",
391 "https://www.komako.org.nz",
392 "http://nzpostcard.co.nz",
393 "http://artizani.co.nz",
394 "http://www.finlaysonpark.school.nz",
395 "http://crimson.co.nz",
396 "http://holyspirit.nz",
397 "http://www.tkkmmokopuna.school.nz",
398 "http://www.pakanae.maori.nz",
399 "http://www.teipukarea.maori.nz",
400 "http://archerpix.com",
401 "https://2019.nethui.nz",
402 "http://www.kupengahao.co.nz",
403 "https://www.lcds-display.nz",
404 "http://waiata.maori.nz",
405 "http://kuraproductions.co.nz",
406 "http://www.biketorqueyamaha.co.nz",
407 "http://www.livingheritage.org.nz",
408 "http://www.zoomin.co.nz",
409 "http://rsnz.natlib.govt.nz",
410 "http://otorohanga.directorybusiness.co.nz",
411 "http://reoora.co.nz",
412 "http://w3vietnam.org.nz",
413 "https://rehuamarae.co.nz",
414 "https://www.electionresults.org.nz",
415 "https://www.ngamanawainc.co.nz",
416 "https://www.rotorua-rafting.co.nz",
417 "https://www.taitokerautrust.org.nz",
418 "https://www.wingspan.co.nz",
419 "http://www.kkmmaungarongo.co.nz",
420 "http://kete.wcl.govt.nz",
421 "http://www.heartland.co.nz",
422 "http://www.electionresults.govt.nz",
423 "https://www.tematawai.maori.nz",
424 "http://hana.co.nz",
425 "http://www.tereowrap.nz",
426 "http://rurued.school.nz",
427 "http://www.twtop.school.nz",
428 "http://rexedra.gen.nz",
429 "http://archive.stats.govt.nz",
430 "https://liveresults.co.nz",
431 "https://www.e-agent.nz",
432 "http://tiritiowaitangi.govt.nz",
433 "http://www.hrc.co.nz",
434 "http://animations.tewhanake.maori.nz",
435 "https://interactives.stuff.co.nz",
436 "http://avonside.net",
437 "http://www.methodist.org.nz",
438 "https://www.tasteofplenty.co.nz",
439 "http://www.maoriinvestments.co.nz",
440 "https://m.wairarapatv.co.nz",
441 "http://www.gans.co.nz",
442 "https://ttw1.cwp.govt.nz",
443 "http://ngarauhuia.ngatiapakiterato.iwi.nz",
444 "https://www.tuiatematangi.ac.nz",
445 "http://tetaurawhiri.govt.nz",
446 "http://maori.tki.org.nz",
447 "http://www.topomap.co.nz",
448 "https://www.puhaandpakeha.co.nz",
449 "https://haereheikaiako.co.nz",
450 "https://paekupu.co.nz",
451 "https://curriculumtool.education.govt.nz",
452 "http://firstworldwar.tki.org.nz",
453 "http://www.28maoribattalion.org.nz",
454 "https://hepatakakupu.nz",
455 "https://www.zenbu.co.nz",
456 "http://www.matarikifestival.org.nz",
457 "http://pukapuka.nz",
458 "http://ngatipahauwera.co.nz", 2/2 2/2
459 "http://southerntribes.co.nz",
460 "https://player.vimeo.com",
461 "http://tmoa.tki.org.nz",
462 "http://www.writersfestival.co.nz",
463 "http://talkingtothecan.com",
464 "https://www.whanau-tahi.school.nz",
465 "http://satellites.co.nz",
466 "http://auturoa.nz",
467 "http://www.tuwharetoa.iwi.nz",
468 "http://kmpmusic.co.nz",
469 "http://www.temarareo.org",
470 "http://archive.electionresults.govt.nz",
471 "http://kaiiwicamp.nz",
472 "http://tehauora.org.nz",
473 "http://temahurehure.maori.nz",
474 "http://www.runanga.co.nz"
475 ],
476 "numPagesInMRICount" : 4360,
477 "numPagesContainingMRICount" : 9641
478}
479
480
481----------------------------
482
483The remainder: 80 NZ sites detected as not containing pages InMRI, but with positive number of pages detected as containsMRI:
484
485db.Websites.aggregate([
486 {
487 $match: {
488 $and: [
489 {numPagesContainingMRI: {$gt: 0}},
490 {numPagesInMRI: {$eq: 0}},
491 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
492 ]
493 }
494 },
495 { $unwind: "$geoLocationCountryCode" },
496 {
497 $group: {
498 _id: "nz",
499 count: { $sum: 1 },
500 domain: { $addToSet: '$domain' },
501 numPagesInMRICount: { $sum: '$numPagesInMRI' },
502 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
503 }
504 },
505 { $sort : { count : -1} }
506]);
507
508
509Find pages for testing with:
510 db.getCollection('Webpages').find({URL:/ashtangatauranga\.co\.nz/, containsMRI: true, mriSentenceCount: {$gt: 0}})
511
512
513/* 1 */
514{
515 "_id" : "nz",
516 "count" : 80.0,
517 "domain" : [
518X "http://www.zoomin.co.nz", [map site, so placenames]
519X "http://www.biketorqueyamaha.co.nz", [placenames] {includes "http://biketorqueyamaha.co.nz"}
520X "http://archerpix.com", [photo captions containing placenames]
521X "http://philipbeadle.co.nz", [art captions containing placenames]
522X "https://2019.nethui.nz", [Just MRI words in ENG sentences]
523X "http://crimson.co.nz", [address]
524+ "http://holyspirit.nz", (e.g. https://holyspirit.nz/wp-content/uploads/02-25-2018-Newsletter.pdf)
525X "https://www.wingspan.co.nz", [1 page, 1 conjoined word, looks like placename]
526X "http://nzpostcard.co.nz", [postcards with placenames]
527+ "https://www.ngamanawainc.co.nz", (e.g. https://www.ngamanawainc.co.nz/history/id/60) {includes "http://www.ngamanawainc.co.nz"}
528
529+ "http://www.finlaysonpark.school.nz", [e.g. http://www.finlaysonpark.school.nz/58/pages/61-team-overview; Some Tongan language pages and Samoan pages]
530X "http://artizani.co.nz", [address]
531+ "http://www.w3vietnam.org.nz", [e.g. http://www.w3vietnam.org.nz/w3%20mihi%20ki%20ka%20hoia.htm] (includes "http://w3vietnam.org.nz")
532X "https://sooty.nz", [names, war death notices, place names]
533X? "http://rakaumanga.school.nz", [School has a longer Maori title name, but that's it] {includes "http://www.rakaumanga.school.nz"}
534X "http://www.rotoruanz.com", [e.g. https://www.rotoruanz.com/RNZ/media/Media-Library/SOI-Board-version-08-June-2018-final-published-on-website.pdf]
535X "https://www.cruisetourstauranga.co.nz", [English with Tauranga being MRI placename]
536X "http://www.jeremybaker.nz", [one word, HOkio]
537
538X "https://liveresults.co.nz", [canoe sports team names]
539X "http://rexedra.gen.nz", [ENG sentence with MRI words]
540+ "https://www.takitimu.ac.nz", [school of performing arts, example MRI sentence at https://www.takitimu.ac.nz/about-us]
541X "http://www.electionresults.govt.nz", [placenames and misdetection of Return to Home Page] {includes https://www.electionresults.org.nz which seems to be an alias for the same} {includes "http://archive.electionresults.govt.nz"}
542+ "https://kotahimiriona.co.nz", e.g. (https://kotahimiriona.co.nz/about-the-app/)
543+ "https://rehuamarae.co.nz", (e.g. https://rehuamarae.co.nz/)
544+ "http://reoora.co.nz", (e.g. https://reoora.co.nz/about/)
545
546X "http://otorohanga.directorybusiness.co.nz", [placenames]
547X "http://waitarahistory.org.nz", [placenames and misdetection of sentences of the form "31, No 262." as MRI]
548+ "https://manawatuheritage.pncc.govt.nz", (e.g. https://manawatuheritage.pncc.govt.nz/about)
549+ "http://rsnz.natlib.govt.nz", [e.g. http://rsnz.natlib.govt.nz/volume/rsnz_26/rsnz_26_00_004560.html] NOTE: site appears to use Greenstone
550X "https://www.rotorua-rafting.co.nz", [placenames]
551+ "https://www.taitokerautrust.org.nz", (e.g. https://www.taitokerautrust.org.nz/)
552+ "http://tewikiotereomaori.nz", (e.g. https://tewikiotereomaori.nz/about/)
553+ "https://www.korokikahukura.co.nz", (e.g. https://www.korokikahukura.co.nz/ko-wai-m257tou.html about the connection to the Waikato River)
554
555X "https://www.puhaandpakeha.co.nz", [ENG sentences with MRI words]
556X "http://myfathersworld.net.nz", [placenames]
557X "https://www.ashtangatauranga.co.nz", [misdetection]
558+ "https://www.pinterest.nz", (e.g. sentence "E tu Whare Rangi, e tu Whare Pataka, kia ora koutou nga Whare Tipuna o te Ao Maori." at https://www.pinterest.nz/amp/marylizw03/maori/)
559+ "https://www.rereahu.maori.nz", (e.g. https://www.rereahu.maori.nz/uploads/72717/files/168860/Te_Maru_o_Rereahu_Newsletter_No_5_February_2011.pdf)
560+ "http://givealittle.co.nz", (e.g. https://givealittle.co.nz/cause/rebuild-tapu-te-ranga-marae which contains greetings phrases and sentence ""Nā te ringa tangata i hanga te whare Nā te tuarā o te whare i whakatipu i te tangata")
561X "http://www.gans.co.nz", [placenames]
562+ "https://kaiiwicamp.nz", [placenames] {includes "http://kaiiwicamp.nz"}
563+ "http://ngarauhuia.ngatiapakiterato.iwi.nz", (e.g. http://ngarauhuia.ngatiapakiterato.iwi.nz/pdf/Nga-Rau-Huia_Raumati_2015_Issue1.pdf)
564+ "https://m.wairarapatv.co.nz", (e.g. https://m.wairarapatv.co.nz/archive/i/JGmJqCZCNK8/te-ara-whanui-kura-kaupapa-maori-o-nga-khanga-reo-o-te-awa-kairangi)
565
566X "http://www.methodist.org.nz", [ENG sentence with MRI words]
567+ "http://avonside.net", (e.g. http://avonside.net/Nicky/indexmain.htm)
568X "http://www.ruralfind.co.nz", [placenames]
569+ "http://www.maoriinvestments.co.nz", (e.g. Vision, Mission & Values tab of https://maoriinvestments.co.nz/about/organisation)
570+ "http://conference.tpwt.maori.nz", (e.g. https://www.tpwt.maori.nz/te-mahe-matauranga/)
571+ "https://www.puau.school.nz", (e.g. https://www.puau.school.nz/k%C4%81inga-home)
572+? "http://ngatiwhakaue.iwi.nz", (e.g. greetings at http://ngatiwhakaue.iwi.nz/)
573X "http://www.nzpcn.org.nz", [lots of misdetections and one sentence in English with MRI words, e.g. "Kaori is a Taonga to Maori".] {includes "http://dev.nzpcn.org.nz"}
574+? "https://interactives.stuff.co.nz", [2 x page mostly empty except for the title "TE WIKI O TE REO MĀORI Māori"]
575+ "http://tehauora.org.nz", (e.g. http://tehauora.org.nz/ and proverb at bottom of http://tehauora.org.nz/about-us)
576
577+ "http://temahurehure.maori.nz", (e.g. greeting http://temahurehure.maori.nz/site/file/temp/hirebook.pdf)
578X "http://pukekohe.directorybusiness.co.nz", [placenames]
579+!! "http://www.temarareo.org", (dictionary with sample sentences. e.g. http://www.temarareo.org/PAPAKUPU/dictionary-searchresults/A-2.htm)
580X "https://www.tasteofplenty.co.nz", [Tauranga placename and misdetection of "No one went away hungry] {includes "http://www.tasteofplenty.co.nz"}
581
582+ "http://www.tetaumuturunanga.iwi.nz", (e.g. http://www.tetaumuturunanga.iwi.nz/wp-content/uploads/2016/04/Malvern-Cultural-Narrative-Draft.pdf)
583
584
585X "https://www.blushandbrows.nz", [misdetection of "Makeup..."]
586X "http://talkingtothecan.com", [misdetection of things like ""22, no." and mistaking ENG sentences with MRI words]
587
588+? "http://whatonga.school.nz", [school title]
589+? "https://player.vimeo.com", [Video titles contain MRI sentences and MRI titles interspersed in ENG sentence. Video's audio content may be in MRI]
590+ "http://www.writersfestival.co.nz", (e.g. http://www.writersfestival.co.nz/news/Page1/urgently-relevant-novel-wins-countrys-richest-literary-award/)
591+? "http://southerntribes.co.nz", [brazilian jiu jitsu site with a greeting in MRI on main page]
592+ "http://www.kmk.maori.nz", (e.g. http://www.kmk.maori.nz/kmk-events)
593+ "https://www.stats.govt.nz", (e.g. http://archive.stats.govt.nz/Census/2001-census-data/2001-census-pacific-profiles/cook-island-maori-people-in-new-zealand.aspx)
594X "http://www.eventcinemas.co.nz", [placenames and misdetection of ENG "Atura Hotels"]
595X "https://www.zenbu.co.nz" [misdetection and NZ school addresses]
596 ],
597 "numPagesInMRICount" : 0,
598 "numPagesContainingMRICount" : 1673
599}
Note: See TracBrowser for help on using the repository browser.