source: other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json@ 33890

Last change on this file since 33890 was 33890, checked in by ak19, 4 years ago

Finished going through NZ sites listing of numPagesContainingMRI > 0 and manually determining which of these sites really contained at least one webpage containing at least one sentence inMRI.

File size: 29.5 KB
Line 
1/*
2
3db.Websites.aggregate([
4 {
5 $match: {
6 $and: [
7 {numPagesInMRI: {$gt: 0}},
8 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
9 ]
10 }
11 },
12 { $unwind: "$geoLocationCountryCode" },
13 {
14 $group: {
15 _id: "nz",
16 count: { $sum: 1 },
17 domain: { $addToSet: '$domain' },
18 numPagesInMRICount: { $sum: '$numPagesInMRI' },
19 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
20 }
21 },
22 { $sort : { count : -1} }
23]);
24
25For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted.
26
27For all but NZ, get final column results with:
28 db.getCollection('Websites').find({domain:/coggle\.it/})
29And can check for URLs with:
30 db.getCollection('Webpages').find({URL: /coggle\.it/, isMRI: true})
31
32
33NOTES:
341. DE:
35
36"de","2.0","0+1","9+35 misdetected", http://www.cartogiraffe.com, https://www.cartogiraffe.com,
37Ought to be 2+2 numPagesInMRICount and 9+2 numPagesContainingMRICount:
38- both cartogiraffe.com pages were identical and had mostly MRI sentences with one name not being MRI. So isMRI should have been true for both pages.
39- Only one of the 2 MRI translations of the universal declaration of human rights at http://www.udhr.de got downloaded. A total of 75 pages were downloaded, but more translated pages appeared to be on the webpage. Not sure why the crawl had a _SUCCESS file to indicate completed download.
40- Then http://www.udhr.de had 35-1 non-MRI language translations of the universal declaration of human rights where one or more sentences were misdetected as MRI. With the additional MRI page that didn't get downloaded, should have 9+2 = 11 pages containing MRI.
41
42So instead of
43"de","2.0","1","44", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
44"de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
45
46
47"au","3.0",7+0+1,83+1+3,https://www.kiwiproperty.com, https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd,https://koreromaori.com
48
492. US:
50aclhokiangarocks.blogspot.com contains at least a page with MRI paragraphs. See http://aclhokiangarocks.blogspot.com/feeds/posts/default under section "Nga Tuhinga o tatou Tupuna"
51Although this page has been crawled by Nutch, the contents were presented in the blog in a complex way and therefore the text wasn't retrieved here. See also the dedicated page this text should have been in http://aclhokiangarocks.blogspot.com/2012/05/nga-tuhinga-o-tatou-tupuna.html
52
53"_id","siteCount","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
54"nz","176.0" containsMRI vs 96 pages inMRI,"4360","9641" in 176 containsMRI pages vs 7968 in isMRI pages
55"us","29.0",
56 1+2+0+0+4+166+0+39 +257+2+21+12+25+13+53+0+1+0+1+11 +32+37+4 +0+0+0 = 681,
57 31+2+2+20+58+166+3+91 +258+2+25+12+66+22+53+6+1+1+2+10 +58+54+6 +1+2+1 = 953,
58 anglicanhistory.org,unicode.org,static-promote.weebly.com,aclhokiangarocks.blogspot.com,bahaiprayers.net,biblehub.com,muhammad.com,godrules.net,m.biblepub.com, krassotkin.ru,gotquestions.org,
59 maorinews.com,maaori.com,kiaorahola.blogspot.com,kjohnsonnz.blogspot.com,pumanawawhangara.blogspot.com,dannykahei.tripod.com,burkekm001.tripod.com,tkkpipipaopao.blogspot.com, manateina.blogspot.com,
60 tatai09.blogspot.com,twttoa.com,tuhua2010.blogspot.com,
61 breaker.audio,drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview,in.pinterest.com/pin/317363104978423418/
62"au","2.0","8","86", https://www.kiwiproperty.com, https://koreromaori.com
63"de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
64"dk","2.0","4","7", *.ngapuhitelevision.com, *.ngapuhiradio.com
65"bg","1.0","2","2", http://anitra.net/activism/humanrights/UDHR/mbf_print.htm, http://anitra.net/activism/humanrights/UDHR/rrt_print.htm
66"cz","1.0","0","1", http://www.henryklahola.nazory.cz/094.Maori.htm, http://henryklahola.nazory.cz/094.Maori.htm
67"es","1.0","1","1", https://www.uv.es/~pla/red.net/intmaori.html
68"fr","1.0","1","1", http://chantsdeluttes.free.fr/versionsinter/page%20maori.html
69"ie","1.0","1","3", https://coggle.it/diagram/WSYB0mLA2QABD5BH/t/ko-au-ko-koe
70
71
72--------------
73
74 https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/#CI1
75 https://stats.stackexchange.com/questions/207584/sample-size-choice-with-binary-outcome
76 https://www.statisticshowto.datasciencecentral.com/z-alpha2-za2/
77
78 N (NZ pages where isMRI comes out true) = 4360
79 solving for n, the sample size
80 confidence level = 90%
81 m, margin of error = 5%
82
83 From the "z alpha/2" table, for 90% confidence, we get a z alpha/2 value of 1.6449 (or 1.645).
84
85 Then the sample size, n, we need is = 1.6449^2 * 4360 / ( 1.6449^2 + (4 * 4359) * 0.05^2) = 255 (rounded up)
86
87
88 For N = 681,
89 sample size n is = 1.6449^2 * 681 / ( 1.6449^2 + (4 * 680) * 0.05^2) = 194 (rounded up)
90
91
92 sample size for NZ: 255 (90% confidence with 5% margine of error, Including a finite correction factor)
93 sample size for US: 194
94
95*/
96
97
98// To add column: "URLs of pages detected as inMRI"
99"_id","siteCount containsMRI","numPagesInMRICount","numPagesContainingMRICount"
100"nz","176.0","4360","9641"
101"us","29.0","681","953"
102"au","2.0","8","86"
103"de","2.0","4","11"
104"dk","2.0","4","7"
105"bg","1.0","2","2"
106"cz","1.0","0","1"
107"es","1.0","1","1"
108"fr","1.0","1","1"
109"ie","1.0","1","3"
110
111Total sites containing MRI: 216
112[of which 96 isMRI sites from NZ]
113Total pages detected as being in MRI: 5062
114Total pages detected as containing MRI sentences: 10706
115
116
117
118NZ - sample 255 pages from:
119/*
120db.Websites.aggregate([
121 {
122 $match: {
123 $and: [
124 {numPagesContainingMRI: {$gt: 0}},
125 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
126 ]
127 }
128 },
129 { $unwind: "$geoLocationCountryCode" },
130 {
131 $group: {
132 _id: "nz",
133 count: { $sum: 1 },
134 domain: { $addToSet: '$domain' },
135 numPagesInMRICount: { $sum: '$numPagesInMRI' },
136 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
137 }
138 },
139 { $sort : { count : -1} }
140]);
141
142
143OR is this better (only numPagesINMRI):
144
145db.Websites.aggregate([
146 {
147 $match: {
148 $and: [
149 {numPagesInMRI: {$gt: 0}},
150 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
151 ]
152 }
153 },
154 { $unwind: "$geoLocationCountryCode" },
155 {
156 $group: {
157 _id: "nz",
158 count: { $sum: 1 },
159 domain: { $addToSet: '$domain' },
160 numPagesInMRICount: { $sum: '$numPagesInMRI' },
161 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
162 }
163 },
164 { $sort : { count : -1} }
165]);
166*/
167
168num NZ sites with > 0 isMRI pages = 96
169Total numPagesInMRI in NZ sites = 4360
170Total numPagesContainingMRI in NZ sites = 7968
171
172Using the results you get a list of domains that matched. 171 nz domains, though it should be 176? -1
173
174Copy each domain (up to 255 of them) and look for the first 1 or 2 max that matches isMRI:
175
1761. db.getCollection('Webpages').find({URL:/pukekohe.directorybusiness.co.nz/, isMRI: true}) - check it contains a positive number of pages in MRI and check the first 1-2 pages to make sure they are indeed in MRI. Note down the ratio of MRI finds. e.g. 2/2.
177
1782. Find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI. Note down the ratio for the first 2 pages.
179db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true})
180
181
182First column: n pages that are in MRI / n sampled isMRI pages
183Second column: n pages that do contain MRI / n sampled pages that are not isMRI yet contain MRI
184
185/* 1 */
186{
187 "_id" : "nz",
188 "count" : 96.0,
189 "domain" : [
190 "http://www.teipukarea.maori.nz", 3/3 1/3
191 "http://ngatipahauwera.co.nz", 2/2, 2/2
192 "http://www.oag.govt.nz", 2/2 0/2
193 "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3
194 "http://tmoa.tki.org.nz", 3/3 3/3
195 "http://www.tewhanake.maori.nz", 3/3 2/3
196 "http://www.matarikifestival.org.nz", 4/4 0/3
197 "http://www.otepoti.school.nz", 3/3 0/4
198!! "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages]
199 "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages]
200 "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence]
201X!! "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI album and song titles] 0 [no other pages containsMRI]
202 "http://maori.livingheritage.org.nz", 2/2 2/2 {includes: http://www.livingheritage.org.nz}
203 "http://pukoro.co.nz", 2/2 0/2
204X "https://register.tpota.org.nz", 0/1 [form] 0/2
205+ "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz", 0/4, 1/3 [but audio content may be in MRI] But there are pages containing MRI to be found by non-random sampling of tehiku.nz, e.g. https://tehiku.nz/te-hiku-radio/te-tangihanga-ki-a-erima-henare/ contains MRI sentences
206!! "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages]
207! "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3
208 "http://kurataiao.tki.org.nz", 3/3, 1/total 3
209
210!! "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages]
211 "http://teaohou.natlib.govt.nz", 4/4, 2/4
212 "http://www.tuwharetoa.iwi.nz", 2/3 0/3
213+ "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html)
214 "https://www.terito.school.nz", 3/3, 0/2 total
215 "https://ttw1.cwp.govt.nz", 3/3 3/3
216 "https://www.whanau-tahi.school.nz", 4/4, 1/2 total
217 "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total
218 "https://teaomaori.news", 3/3, 0/1 total
219 "http://tetaurawhiri.govt.nz", 3/3 /3/3 [Māori Language Commission site]
220 "https://www.tuiatematangi.ac.nz", 4/4 3/3
221 "http://animations.tewhanake.maori.nz", 3/3 3/3
222!! "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages]
223!! "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages]
224 "http://www.28maoribattalion.org.nz", 3/3, 1/3
225 "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3
226 "http://www.brettgraham.co.nz", 1/1 total, 0/3
227!! "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages]
228
229 "http://anglicanprayerbook.nz", 3/3 3/3
230 "http://arataua.nz", 4/4, 2/3
231 "http://maori.tki.org.nz", 3/3 3/3
232DONE (with/out www): "http://www.firstworldwar.tki.org.nz",
233X "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages]
234 "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages]
235 "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages]
236 "https://curriculumtool.education.govt.nz", 4/4, 3/3
237 "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] {includes: "http://www.kurakokiri.maori.nz"}
238 "http://www.kkmmaungarongo.co.nz", 3/3, 3/3
239 "http://www.heartland.co.nz", 3/3, 1/1 total
240 "http://oilcrash.com", 2/2 total, 0/3
241 "http://www.kura-porirua.school.nz", 4/4, 2/3
242 "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages]
243 "https://www.tematawai.maori.nz", 3/3, 3/3
244
245+ "https://www.terakipaewhenua.school.nz",
246+ "http://www.tetaurawhiri.govt.nz",
247+ "http://archive.stats.govt.nz", (1 page isMRI)
248+ "http://tiritiowaitangi.govt.nz",
249+!! "http://www.waiata.maori.nz", {includes: "http://waiata.maori.nz"}
250+ "http://hana.co.nz", [crawled version of page contains MRI sentences, but current page is labelled for mobiles whereas in browser it just has a picture]
251+ "http://kaupare.co.nz",
252+ "http://www.tereowrap.nz",
253?X "https://www.e-agent.nz", [autotranslated? SEO related site in NZ, Chinese, English and MRI] {includes: "https://office.e-agent.nz"}
254 { included: "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] }
255+ "http://www.hrc.co.nz",
256+ "http://ngatiporoukiponeke.org.nz",
257
258+ "http://rurued.school.nz",
259+ "http://www.twtop.school.nz",
260X "https://www.infinite-electronic.nz", [autotranslated product site]
261+!! "http://www.huri-translations.pf",
262+ "https://admin.teara.govt.nz", {included: "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz]}
263+!! "https://tiritiowaitangi.govt.nz",
264+ "http://www.tmoa.tki.org.nz",
265+ "https://www.komako.org.nz", [no longer available, but crawled page had a paragraph in Maori with presumably a translation thereafter]
266+ "http://www.wcl.govt.nz", {included: "http://kete.wcl.govt.nz" as wcl.govt.nz; 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3}
267+!! "http://punareo.co.nz", [waiata]
268
269+ "https://rapuatearatika.education.govt.nz",
270+ "http://tmmkkm.school.nz",
271X "https://www.components-mart.nz", [autotranslated product site]
272+ "http://www.cs.waikato.ac.nz", [Te Taka's pages!]
273+!!! "http://www.kupengahao.co.nz", [MRI language books and resources]
274+ "https://www.hapuhauora.health.nz", [Smokefree site. The one isMRI page has a proper sentence.]
275X "https://www.lcds-display.nz", [autotranslated product site]
276+ "http://cms.sunsmartschools.co.nz", [as sunsmartschools.co.nz, e.g. http://sunsmartschools.co.nz/kura/how-to-become.html]
277+ "http://kuraproductions.co.nz",
278+ "https://keepourmoneyclean.govt.nz", [1 page]
279
280+!! "http://www.tekura.school.nz",
281+ "http://www.tkkmmokopuna.school.nz", [e.g. http://www.tkkmmokopuna.school.nz/newsletter_sets/newsletters/13-wahanga-4-wiki-5-he-puna-korero]
282+ "http://hangaraumatihiko.tki.org.nz", [govt website. e.g. http://hangaraumatihiko.tki.org.nz/te-tupuranga-tangata-me-te-rorohiko/whakatupuranga-3-te-whakatau-tikanga-kawe-i-nga-mahi-whakaoti-hopanga/]
283+ "http://www.pakanae.maori.nz"
284 ],
285 "numPagesInMRICount" : 4360,
286 "numPagesContainingMRICount" : 7968
287}
288
289
29096 sites detected as having isMRI pages - 7 sites/subdomains already included in the existing 96 = 89 sites.
291
292-2.5* product sites -2 non-MRI sites with songlistings or forms etc
293 *0.5 for e-agent.nz site
294= 84.5 sites total that at least contain MRI, most have pages inMRI.
295----------------------------
296
297/* 1 */
298{
299 "_id" : "nz",
300 "count" : 176.0,
301 "domain" : [
302!! "http://pukekohe.directorybusiness.co.nz", 0/2, 0/2, isMRI = 0!!
303 "http://maori.livingheritage.org.nz", 2/2 2/2
304 "http://pukoro.co.nz", 2/2 0/2
305 "http://www.rakaumanga.school.nz", 0/4 0/4
306 "http://www.ngamanawainc.co.nz", 0/2 0/2
307 "https://office.e-agent.nz",
308 "https://www.components-mart.nz",
309 "http://tmmkkm.school.nz",
310 "http://www.rotoruanz.com",
311 "http://www.huri-translations.pf",
312 "https://admin.teara.govt.nz",
313 "http://hangaraumatihiko.tki.org.nz",
314 "https://sexualviolence.victimsinfo.govt.nz",
315 "http://www.tekura.school.nz",
316 "http://philipbeadle.co.nz",
317 "http://www.cs.waikato.ac.nz",
318 "https://www.hapuhauora.health.nz",
319 "http://cms.sunsmartschools.co.nz",
320 "https://keepourmoneyclean.govt.nz",
321 "http://www.kura-porirua.school.nz",
322 "http://waitarahistory.org.nz",
323 "http://oilcrash.com",
324 "http://videos.e-agent.nz",
325 "https://manawatuheritage.pncc.govt.nz",
326 "https://www.terakipaewhenua.school.nz",
327 "http://dev.nzpcn.org.nz",
328 "https://kotahimiriona.co.nz",
329 "http://kurakokiri.maori.nz",
330 "https://www.sporty.co.nz",
331 "http://kaupare.co.nz",
332 "http://ngatiporoukiponeke.org.nz",
333 "https://www.takitimu.ac.nz",
334 "http://www.tetaurawhiri.govt.nz",
335 "http://www.waiata.maori.nz",
336 "http://conference.tpwt.maori.nz",
337 "http://ngatiwhakaue.iwi.nz",
338 "http://www.nzpcn.org.nz",
339 "http://www.ruralfind.co.nz",
340 "https://www.dnc.org.nz",
341 "https://www.puau.school.nz",
342 "https://kaiiwicamp.nz",
343 "https://www.terito.school.nz",
344 "https://www.pinterest.nz",
345 "https://e-ako-pangarau.nzmaths.co.nz",
346 "http://givealittle.co.nz",
347 "https://teaomaori.news",
348 "https://www.korokikahukura.co.nz",
349 "http://myfathersworld.net.nz",
350 "http://www.firstworldwar.tki.org.nz",
351 "https://www.ashtangatauranga.co.nz",
352 "http://biketorqueyamaha.co.nz",
353 "https://www.rereahu.maori.nz",
354 "http://www.tewikiotereomaori.co.nz",
355 "http://www.brettgraham.co.nz",
356 "http://tewikiotereomaori.nz",
357 "http://anglicanprayerbook.nz",
358 "http://arataua.nz",
359 "http://blog.teara.govt.nz",
360 "http://www.otepoti.school.nz",
361 "http://www.kmk.maori.nz",
362 "http://www.eventcinemas.co.nz",
363 "https://www.stats.govt.nz",
364 "http://www.oag.govt.nz", 2/2 0/2
365 "http://whatonga.school.nz",
366 "http://www.tewhanake.maori.nz",
367 "https://www.maoritelevision.com",
368 "http://kuraaiwi.maori.nz",
369 "http://kurataiao.tki.org.nz",
370 "http://teaohou.natlib.govt.nz",
371 "http://www.tetaumuturunanga.iwi.nz",
372 "http://www.tasteofplenty.co.nz",
373 "http://community.nzdl.org",
374 "https://www.blushandbrows.nz",
375 "https://register.tpota.org.nz",
376 "https://cdn.tehiku.nz",
377 "http://www.wcl.govt.nz",
378 "http://www.jeremybaker.nz",
379 "http://punareo.co.nz",
380 "https://rapuatearatika.education.govt.nz",
381 "http://www.kurakokiri.maori.nz",
382 "https://www.cruisetourstauranga.co.nz",
383 "https://sooty.nz",
384 "http://rakaumanga.school.nz",
385 "https://tiritiowaitangi.govt.nz",
386 "http://www.tmoa.tki.org.nz",
387 "http://www.w3vietnam.org.nz",
388 "https://www.infinite-electronic.nz",
389 "https://www.komako.org.nz",
390 "http://nzpostcard.co.nz",
391 "http://artizani.co.nz",
392 "http://www.finlaysonpark.school.nz",
393 "http://crimson.co.nz",
394 "http://holyspirit.nz",
395 "http://www.tkkmmokopuna.school.nz",
396 "http://www.pakanae.maori.nz",
397 "http://www.teipukarea.maori.nz",
398 "http://archerpix.com",
399 "https://2019.nethui.nz",
400 "http://www.kupengahao.co.nz",
401 "https://www.lcds-display.nz",
402 "http://waiata.maori.nz",
403 "http://kuraproductions.co.nz",
404 "http://www.biketorqueyamaha.co.nz",
405 "http://www.livingheritage.org.nz",
406 "http://www.zoomin.co.nz",
407 "http://rsnz.natlib.govt.nz",
408 "http://otorohanga.directorybusiness.co.nz",
409 "http://reoora.co.nz",
410 "http://w3vietnam.org.nz",
411 "https://rehuamarae.co.nz",
412 "https://www.electionresults.org.nz",
413 "https://www.ngamanawainc.co.nz",
414 "https://www.rotorua-rafting.co.nz",
415 "https://www.taitokerautrust.org.nz",
416 "https://www.wingspan.co.nz",
417 "http://www.kkmmaungarongo.co.nz",
418 "http://kete.wcl.govt.nz",
419 "http://www.heartland.co.nz",
420 "http://www.electionresults.govt.nz",
421 "https://www.tematawai.maori.nz",
422 "http://hana.co.nz",
423 "http://www.tereowrap.nz",
424 "http://rurued.school.nz",
425 "http://www.twtop.school.nz",
426 "http://rexedra.gen.nz",
427 "http://archive.stats.govt.nz",
428 "https://liveresults.co.nz",
429 "https://www.e-agent.nz",
430 "http://tiritiowaitangi.govt.nz",
431 "http://www.hrc.co.nz",
432 "http://animations.tewhanake.maori.nz",
433 "https://interactives.stuff.co.nz",
434 "http://avonside.net",
435 "http://www.methodist.org.nz",
436 "https://www.tasteofplenty.co.nz",
437 "http://www.maoriinvestments.co.nz",
438 "https://m.wairarapatv.co.nz",
439 "http://www.gans.co.nz",
440 "https://ttw1.cwp.govt.nz",
441 "http://ngarauhuia.ngatiapakiterato.iwi.nz",
442 "https://www.tuiatematangi.ac.nz",
443 "http://tetaurawhiri.govt.nz",
444 "http://maori.tki.org.nz",
445 "http://www.topomap.co.nz",
446 "https://www.puhaandpakeha.co.nz",
447 "https://haereheikaiako.co.nz",
448 "https://paekupu.co.nz",
449 "https://curriculumtool.education.govt.nz",
450 "http://firstworldwar.tki.org.nz",
451 "http://www.28maoribattalion.org.nz",
452 "https://hepatakakupu.nz",
453 "https://www.zenbu.co.nz",
454 "http://www.matarikifestival.org.nz",
455 "http://pukapuka.nz",
456 "http://ngatipahauwera.co.nz", 2/2 2/2
457 "http://southerntribes.co.nz",
458 "https://player.vimeo.com",
459 "http://tmoa.tki.org.nz",
460 "http://www.writersfestival.co.nz",
461 "http://talkingtothecan.com",
462 "https://www.whanau-tahi.school.nz",
463 "http://satellites.co.nz",
464 "http://auturoa.nz",
465 "http://www.tuwharetoa.iwi.nz",
466 "http://kmpmusic.co.nz",
467 "http://www.temarareo.org",
468 "http://archive.electionresults.govt.nz",
469 "http://kaiiwicamp.nz",
470 "http://tehauora.org.nz",
471 "http://temahurehure.maori.nz",
472 "http://www.runanga.co.nz"
473 ],
474 "numPagesInMRICount" : 4360,
475 "numPagesContainingMRICount" : 9641
476}
477
478
479----------------------------
480
481The remainder: 80 NZ sites detected as not containing pages InMRI, but with positive number of pages detected as containsMRI:
482
483db.Websites.aggregate([
484 {
485 $match: {
486 $and: [
487 {numPagesContainingMRI: {$gt: 0}},
488 {numPagesInMRI: {$eq: 0}},
489 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
490 ]
491 }
492 },
493 { $unwind: "$geoLocationCountryCode" },
494 {
495 $group: {
496 _id: "nz",
497 count: { $sum: 1 },
498 domain: { $addToSet: '$domain' },
499 numPagesInMRICount: { $sum: '$numPagesInMRI' },
500 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
501 }
502 },
503 { $sort : { count : -1} }
504]);
505
506
507Find pages for testing with:
508 db.getCollection('Webpages').find({URL:/ashtangatauranga\.co\.nz/, containsMRI: true, mriSentenceCount: {$gt: 0}})
509
510
511/* 1 */
512{
513 "_id" : "nz",
514 "count" : 80.0,
515 "domain" : [
516X "http://www.zoomin.co.nz", [map site, so placenames]
517X "http://www.biketorqueyamaha.co.nz", [placenames] {includes "http://biketorqueyamaha.co.nz"}
518X "http://archerpix.com", [photo captions containing placenames]
519X "http://philipbeadle.co.nz", [art captions containing placenames]
520X "https://2019.nethui.nz", [Just MRI words in ENG sentences]
521X "http://crimson.co.nz", [address]
522+ "http://holyspirit.nz", (e.g. https://holyspirit.nz/wp-content/uploads/02-25-2018-Newsletter.pdf)
523X "https://www.wingspan.co.nz", [1 page, 1 conjoined word, looks like placename]
524X "http://nzpostcard.co.nz", [postcards with placenames]
525+ "https://www.ngamanawainc.co.nz", (e.g. https://www.ngamanawainc.co.nz/history/id/60) {includes "http://www.ngamanawainc.co.nz"}
526
527+ "http://www.finlaysonpark.school.nz", [e.g. http://www.finlaysonpark.school.nz/58/pages/61-team-overview; Some Tongan language pages and Samoan pages]
528X "http://artizani.co.nz", [address]
529+ "http://www.w3vietnam.org.nz", [e.g. http://www.w3vietnam.org.nz/w3%20mihi%20ki%20ka%20hoia.htm] (includes "http://w3vietnam.org.nz")
530X "https://sooty.nz", [names, war death notices, place names]
531X? "http://rakaumanga.school.nz", [School has a longer Maori title name, but that's it] {includes "http://www.rakaumanga.school.nz"}
532X "http://www.rotoruanz.com", [e.g. https://www.rotoruanz.com/RNZ/media/Media-Library/SOI-Board-version-08-June-2018-final-published-on-website.pdf]
533X "https://www.cruisetourstauranga.co.nz", [English with Tauranga being MRI placename]
534X "http://www.jeremybaker.nz", [one word, HOkio]
535
536X "https://liveresults.co.nz", [canoe sports team names]
537X "http://rexedra.gen.nz", [ENG sentence with MRI words]
538+ "https://www.takitimu.ac.nz", [school of performing arts, example MRI sentence at https://www.takitimu.ac.nz/about-us]
539X "http://www.electionresults.govt.nz", [placenames and misdetection of Return to Home Page] {includes https://www.electionresults.org.nz which seems to be an alias for the same} {includes "http://archive.electionresults.govt.nz"}
540+ "https://kotahimiriona.co.nz", e.g. (https://kotahimiriona.co.nz/about-the-app/)
541+ "https://rehuamarae.co.nz", (e.g. https://rehuamarae.co.nz/)
542+ "http://reoora.co.nz", (e.g. https://reoora.co.nz/about/)
543
544X "http://otorohanga.directorybusiness.co.nz", [placenames]
545X "http://waitarahistory.org.nz", [placenames and misdetection of sentences of the form "31, No 262." as MRI]
546+ "https://manawatuheritage.pncc.govt.nz", (e.g. https://manawatuheritage.pncc.govt.nz/about)
547+ "http://rsnz.natlib.govt.nz", [e.g. http://rsnz.natlib.govt.nz/volume/rsnz_26/rsnz_26_00_004560.html] NOTE: site appears to use Greenstone
548X "https://www.rotorua-rafting.co.nz", [placenames]
549+ "https://www.taitokerautrust.org.nz", (e.g. https://www.taitokerautrust.org.nz/)
550+ "http://tewikiotereomaori.nz", (e.g. https://tewikiotereomaori.nz/about/)
551+ "https://www.korokikahukura.co.nz", (e.g. https://www.korokikahukura.co.nz/ko-wai-m257tou.html about the connection to the Waikato River)
552
553X "https://www.puhaandpakeha.co.nz", [ENG sentences with MRI words]
554X "http://myfathersworld.net.nz", [placenames]
555X "https://www.ashtangatauranga.co.nz", [misdetection]
556+ "https://www.pinterest.nz", (e.g. sentence "E tu Whare Rangi, e tu Whare Pataka, kia ora koutou nga Whare Tipuna o te Ao Maori." at https://www.pinterest.nz/amp/marylizw03/maori/)
557+ "https://www.rereahu.maori.nz", (e.g. https://www.rereahu.maori.nz/uploads/72717/files/168860/Te_Maru_o_Rereahu_Newsletter_No_5_February_2011.pdf)
558+ "http://givealittle.co.nz", (e.g. https://givealittle.co.nz/cause/rebuild-tapu-te-ranga-marae which contains greetings phrases and sentence ""Nā te ringa tangata i hanga te whare Nā te tuarā o te whare i whakatipu i te tangata")
559X "http://www.gans.co.nz", [placenames]
560+ "https://kaiiwicamp.nz", [placenames] {includes "http://kaiiwicamp.nz"}
561+ "http://ngarauhuia.ngatiapakiterato.iwi.nz", (e.g. http://ngarauhuia.ngatiapakiterato.iwi.nz/pdf/Nga-Rau-Huia_Raumati_2015_Issue1.pdf)
562+ "https://m.wairarapatv.co.nz", (e.g. https://m.wairarapatv.co.nz/archive/i/JGmJqCZCNK8/te-ara-whanui-kura-kaupapa-maori-o-nga-khanga-reo-o-te-awa-kairangi)
563
564X "http://www.methodist.org.nz", [ENG sentence with MRI words]
565+ "http://avonside.net", (e.g. http://avonside.net/Nicky/indexmain.htm)
566X "http://www.ruralfind.co.nz", [placenames]
567+ "http://www.maoriinvestments.co.nz", (e.g. Vision, Mission & Values tab of https://maoriinvestments.co.nz/about/organisation)
568+ "http://conference.tpwt.maori.nz", (e.g. https://www.tpwt.maori.nz/te-mahe-matauranga/)
569+ "https://www.puau.school.nz", (e.g. https://www.puau.school.nz/k%C4%81inga-home)
570+? "http://ngatiwhakaue.iwi.nz", (e.g. greetings at http://ngatiwhakaue.iwi.nz/)
571X "http://www.nzpcn.org.nz", [lots of misdetections and one sentence in English with MRI words, e.g. "Kaori is a Taonga to Maori".] {includes "http://dev.nzpcn.org.nz"}
572+? "https://interactives.stuff.co.nz", [2 x page mostly empty except for the title "TE WIKI O TE REO MĀORI Māori"]
573+ "http://tehauora.org.nz", (e.g. http://tehauora.org.nz/ and proverb at bottom of http://tehauora.org.nz/about-us)
574
575+ "http://temahurehure.maori.nz", (e.g. greeting http://temahurehure.maori.nz/site/file/temp/hirebook.pdf)
576X "http://pukekohe.directorybusiness.co.nz", [placenames]
577+!! "http://www.temarareo.org", (dictionary with sample sentences. e.g. http://www.temarareo.org/PAPAKUPU/dictionary-searchresults/A-2.htm)
578X "https://www.tasteofplenty.co.nz", [Tauranga placename and misdetection of "No one went away hungry] {includes "http://www.tasteofplenty.co.nz"}
579
580+ "http://www.tetaumuturunanga.iwi.nz", (e.g. http://www.tetaumuturunanga.iwi.nz/wp-content/uploads/2016/04/Malvern-Cultural-Narrative-Draft.pdf)
581
582
583X "https://www.blushandbrows.nz", [misdetection of "Makeup..."]
584X "http://talkingtothecan.com", [misdetection of things like ""22, no." and mistaking ENG sentences with MRI words]
585
586+? "http://whatonga.school.nz", [school title]
587+? "https://player.vimeo.com", [Video titles contain MRI sentences and MRI titles interspersed in ENG sentence. Video's audio content may be in MRI]
588+ "http://www.writersfestival.co.nz", (e.g. http://www.writersfestival.co.nz/news/Page1/urgently-relevant-novel-wins-countrys-richest-literary-award/)
589+? "http://southerntribes.co.nz", [brazilian jiu jitsu site with a greeting in MRI on main page]
590+ "http://www.kmk.maori.nz", (e.g. http://www.kmk.maori.nz/kmk-events)
591+ "https://www.stats.govt.nz", (e.g. http://archive.stats.govt.nz/Census/2001-census-data/2001-census-pacific-profiles/cook-island-maori-people-in-new-zealand.aspx)
592X "http://www.eventcinemas.co.nz", [placenames and misdetection of ENG "Atura Hotels"]
593X "https://www.zenbu.co.nz" [misdetection and NZ school addresses]
594 ],
595 "numPagesInMRICount" : 0,
596 "numPagesContainingMRICount" : 1673
597}
Note: See TracBrowser for help on using the repository browser.