1 | /*
|
---|
2 |
|
---|
3 | db.Websites.aggregate([
|
---|
4 | {
|
---|
5 | $match: {
|
---|
6 | $and: [
|
---|
7 | {numPagesInMRI: {$gt: 0}},
|
---|
8 | {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
|
---|
9 | ]
|
---|
10 | }
|
---|
11 | },
|
---|
12 | { $unwind: "$geoLocationCountryCode" },
|
---|
13 | {
|
---|
14 | $group: {
|
---|
15 | _id: "nz",
|
---|
16 | count: { $sum: 1 },
|
---|
17 | domain: { $addToSet: '$domain' },
|
---|
18 | numPagesInMRICount: { $sum: '$numPagesInMRI' },
|
---|
19 | numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
|
---|
20 | }
|
---|
21 | },
|
---|
22 | { $sort : { count : -1} }
|
---|
23 | ]);
|
---|
24 |
|
---|
25 | For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted.
|
---|
26 |
|
---|
27 | For all but NZ, get final column results with:
|
---|
28 | db.getCollection('Websites').find({domain:/coggle\.it/})
|
---|
29 | And can check for URLs with:
|
---|
30 | db.getCollection('Webpages').find({URL: /coggle\.it/, isMRI: true})
|
---|
31 |
|
---|
32 |
|
---|
33 | NOTES:
|
---|
34 | 1. DE:
|
---|
35 |
|
---|
36 | "de","2.0","0+1","9+35 misdetected", http://www.cartogiraffe.com, https://www.cartogiraffe.com,
|
---|
37 | Ought to be 2+2 numPagesInMRICount and 9+2 numPagesContainingMRICount:
|
---|
38 | - both cartogiraffe.com pages were identical and had mostly MRI sentences with one name not being MRI. So isMRI should have been true for both pages.
|
---|
39 | - Only one of the 2 MRI translations of the universal declaration of human rights at http://www.udhr.de got downloaded. A total of 75 pages were downloaded, but more translated pages appeared to be on the webpage. Not sure why the crawl had a _SUCCESS file to indicate completed download.
|
---|
40 | - Then http://www.udhr.de had 35-1 non-MRI language translations of the universal declaration of human rights where one or more sentences were misdetected as MRI. With the additional MRI page that didn't get downloaded, should have 9+2 = 11 pages containing MRI.
|
---|
41 |
|
---|
42 | So instead of
|
---|
43 | "de","2.0","1","44", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
|
---|
44 | "de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
|
---|
45 |
|
---|
46 |
|
---|
47 | "au","3.0",7+0+1,83+1+3,https://www.kiwiproperty.com, https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd,https://koreromaori.com
|
---|
48 |
|
---|
49 | 2. US:
|
---|
50 | aclhokiangarocks.blogspot.com contains at least a page with MRI paragraphs. See http://aclhokiangarocks.blogspot.com/feeds/posts/default under section "Nga Tuhinga o tatou Tupuna"
|
---|
51 | Although this page has been crawled by Nutch, the contents were presented in the blog in a complex way and therefore the text wasn't retrieved here. See also the dedicated page this text should have been in http://aclhokiangarocks.blogspot.com/2012/05/nga-tuhinga-o-tatou-tupuna.html
|
---|
52 |
|
---|
53 | "_id","siteCount","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
|
---|
54 | "nz","176.0" containsMRI vs 96 pages inMRI,"4360","9641" in 176 containsMRI pages vs 7968 in isMRI pages
|
---|
55 | "us","29.0",
|
---|
56 | 1+2+0+0+4+166+0+39 +257+2+21+12+25+13+53+0+1+0+1+11 +32+37+4 +0+0+0 = 681,
|
---|
57 | 31+2+2+20+58+166+3+91 +258+2+25+12+66+22+53+6+1+1+2+10 +58+54+6 +1+2+1 = 953,
|
---|
58 | anglicanhistory.org,unicode.org,static-promote.weebly.com,aclhokiangarocks.blogspot.com,bahaiprayers.net,biblehub.com,muhammad.com,godrules.net,m.biblepub.com, krassotkin.ru,gotquestions.org,
|
---|
59 | maorinews.com,maaori.com,kiaorahola.blogspot.com,kjohnsonnz.blogspot.com,pumanawawhangara.blogspot.com,dannykahei.tripod.com,burkekm001.tripod.com,tkkpipipaopao.blogspot.com, manateina.blogspot.com,
|
---|
60 | tatai09.blogspot.com,twttoa.com,tuhua2010.blogspot.com,
|
---|
61 | breaker.audio,drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview,in.pinterest.com/pin/317363104978423418/
|
---|
62 | "au","2.0","8","86", https://www.kiwiproperty.com, https://koreromaori.com
|
---|
63 | "de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
|
---|
64 | "dk","2.0","4","7", *.ngapuhitelevision.com, *.ngapuhiradio.com
|
---|
65 | "bg","1.0","2","2", http://anitra.net/activism/humanrights/UDHR/mbf_print.htm, http://anitra.net/activism/humanrights/UDHR/rrt_print.htm
|
---|
66 | "cz","1.0","0","1", http://www.henryklahola.nazory.cz/094.Maori.htm, http://henryklahola.nazory.cz/094.Maori.htm
|
---|
67 | "es","1.0","1","1", https://www.uv.es/~pla/red.net/intmaori.html
|
---|
68 | "fr","1.0","1","1", http://chantsdeluttes.free.fr/versionsinter/page%20maori.html
|
---|
69 | "ie","1.0","1","3", https://coggle.it/diagram/WSYB0mLA2QABD5BH/t/ko-au-ko-koe
|
---|
70 |
|
---|
71 |
|
---|
72 | --------------
|
---|
73 |
|
---|
74 | https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/#CI1
|
---|
75 | https://stats.stackexchange.com/questions/207584/sample-size-choice-with-binary-outcome
|
---|
76 | https://www.statisticshowto.datasciencecentral.com/z-alpha2-za2/
|
---|
77 |
|
---|
78 | N (NZ pages where isMRI comes out true) = 4360
|
---|
79 | solving for n, the sample size
|
---|
80 | confidence level = 90%
|
---|
81 | m, margin of error = 5%
|
---|
82 |
|
---|
83 | From the "z alpha/2" table, for 90% confidence, we get a z alpha/2 value of 1.6449 (or 1.645).
|
---|
84 |
|
---|
85 | Then the sample size, n, we need is = 1.6449^2 * 4360 / ( 1.6449^2 + (4 * 4359) * 0.05^2) = 255 (rounded up)
|
---|
86 |
|
---|
87 |
|
---|
88 | For N = 681,
|
---|
89 | sample size n is = 1.6449^2 * 681 / ( 1.6449^2 + (4 * 680) * 0.05^2) = 194 (rounded up)
|
---|
90 |
|
---|
91 |
|
---|
92 | sample size for NZ: 255 (90% confidence with 5% margine of error, Including a finite correction factor)
|
---|
93 | sample size for US: 194
|
---|
94 |
|
---|
95 | */
|
---|
96 |
|
---|
97 |
|
---|
98 | // To add column: "URLs of pages detected as inMRI"
|
---|
99 | "_id","siteCount containsMRI","numPagesInMRICount","numPagesContainingMRICount"
|
---|
100 | X"nz","176.0","4360","9641"
|
---|
101 | "nz", "166.0", "?", "?"
|
---|
102 | "us","29.0","681","953"
|
---|
103 | "au","2.0","8","86"
|
---|
104 | "de","2.0","4","11"
|
---|
105 | "dk","2.0","4","7"
|
---|
106 | "bg","1.0","2","2"
|
---|
107 | "cz","1.0","0","1"
|
---|
108 | "es","1.0","1","1"
|
---|
109 | "fr","1.0","1","1"
|
---|
110 | "ie","1.0","1","3"
|
---|
111 |
|
---|
112 | Total sites containing MRI: 216
|
---|
113 | Total num sites detected as containing MRI: 868 [db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()]
|
---|
114 | [of which 96 isMRI sites from NZ]
|
---|
115 | Total pages detected as being in MRI: 5062
|
---|
116 | Total pages detected as containing MRI sentences: 10706
|
---|
117 |
|
---|
118 |
|
---|
119 |
|
---|
120 | NZ - sample 255 pages from:
|
---|
121 | /*
|
---|
122 | db.Websites.aggregate([
|
---|
123 | {
|
---|
124 | $match: {
|
---|
125 | $and: [
|
---|
126 | {numPagesContainingMRI: {$gt: 0}},
|
---|
127 | {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
|
---|
128 | ]
|
---|
129 | }
|
---|
130 | },
|
---|
131 | { $unwind: "$geoLocationCountryCode" },
|
---|
132 | {
|
---|
133 | $group: {
|
---|
134 | _id: "nz",
|
---|
135 | count: { $sum: 1 },
|
---|
136 | domain: { $addToSet: '$domain' },
|
---|
137 | numPagesInMRICount: { $sum: '$numPagesInMRI' },
|
---|
138 | numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
|
---|
139 | }
|
---|
140 | },
|
---|
141 | { $sort : { count : -1} }
|
---|
142 | ]);
|
---|
143 |
|
---|
144 |
|
---|
145 | OR is this better (only numPagesINMRI):
|
---|
146 |
|
---|
147 | db.Websites.aggregate([
|
---|
148 | {
|
---|
149 | $match: {
|
---|
150 | $and: [
|
---|
151 | {numPagesInMRI: {$gt: 0}},
|
---|
152 | {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
|
---|
153 | ]
|
---|
154 | }
|
---|
155 | },
|
---|
156 | { $unwind: "$geoLocationCountryCode" },
|
---|
157 | {
|
---|
158 | $group: {
|
---|
159 | _id: "nz",
|
---|
160 | count: { $sum: 1 },
|
---|
161 | domain: { $addToSet: '$domain' },
|
---|
162 | numPagesInMRICount: { $sum: '$numPagesInMRI' },
|
---|
163 | numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
|
---|
164 | }
|
---|
165 | },
|
---|
166 | { $sort : { count : -1} }
|
---|
167 | ]);
|
---|
168 | */
|
---|
169 |
|
---|
170 | num NZ sites with > 0 isMRI pages = 96
|
---|
171 | Total numPagesInMRI in NZ sites = 4360
|
---|
172 | Total numPagesContainingMRI in NZ sites = 7968
|
---|
173 |
|
---|
174 | Using the results you get a list of domains that matched. 171 nz domains, though it should be 176? -1
|
---|
175 |
|
---|
176 | Copy each domain (up to 255 of them) and look for the first 1 or 2 max that matches isMRI:
|
---|
177 |
|
---|
178 | 1. db.getCollection('Webpages').find({URL:/pukekohe.directorybusiness.co.nz/, isMRI: true}) - check it contains a positive number of pages in MRI and check the first 1-2 pages to make sure they are indeed in MRI. Note down the ratio of MRI finds. e.g. 2/2.
|
---|
179 |
|
---|
180 | 2. Find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI. Note down the ratio for the first 2 pages.
|
---|
181 | db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true})
|
---|
182 |
|
---|
183 |
|
---|
184 | First column: n pages that are in MRI / n sampled isMRI pages
|
---|
185 | Second column: n pages that do contain MRI / n sampled pages that are not isMRI yet contain MRI
|
---|
186 |
|
---|
187 | /* 1 */
|
---|
188 | {
|
---|
189 | "_id" : "nz",
|
---|
190 | "count" : 96.0,
|
---|
191 | "domain" : [
|
---|
192 | "http://www.teipukarea.maori.nz", 3/3 1/3
|
---|
193 | "http://ngatipahauwera.co.nz", 2/2, 2/2
|
---|
194 | "http://www.oag.govt.nz", 2/2 0/2
|
---|
195 | "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3
|
---|
196 | "http://tmoa.tki.org.nz", 3/3 3/3
|
---|
197 | "http://www.tewhanake.maori.nz", 3/3 2/3
|
---|
198 | "http://www.matarikifestival.org.nz", 4/4 0/3
|
---|
199 | "http://www.otepoti.school.nz", 3/3 0/4
|
---|
200 | !! "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages]
|
---|
201 | "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages]
|
---|
202 | "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence]
|
---|
203 | X!! "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI album and song titles] 0 [no other pages containsMRI]
|
---|
204 | "http://maori.livingheritage.org.nz", 2/2 2/2 {includes: http://www.livingheritage.org.nz}
|
---|
205 | "http://pukoro.co.nz", 2/2 0/2
|
---|
206 | X "https://register.tpota.org.nz", 0/1 [form] 0/2
|
---|
207 | + "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz", 0/4, 1/3 [but audio content may be in MRI] But there are pages containing MRI to be found by non-random sampling of tehiku.nz, e.g. https://tehiku.nz/te-hiku-radio/te-tangihanga-ki-a-erima-henare/ contains MRI sentences
|
---|
208 | !! "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages]
|
---|
209 | ! "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3
|
---|
210 | "http://kurataiao.tki.org.nz", 3/3, 1/total 3
|
---|
211 |
|
---|
212 | !! "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages]
|
---|
213 | "http://teaohou.natlib.govt.nz", 4/4, 2/4
|
---|
214 | "http://www.tuwharetoa.iwi.nz", 2/3 0/3
|
---|
215 | + "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html)
|
---|
216 | "https://www.terito.school.nz", 3/3, 0/2 total
|
---|
217 | "https://ttw1.cwp.govt.nz", 3/3 3/3
|
---|
218 | "https://www.whanau-tahi.school.nz", 4/4, 1/2 total
|
---|
219 | "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total
|
---|
220 | "https://teaomaori.news", 3/3, 0/1 total
|
---|
221 | "http://tetaurawhiri.govt.nz", 3/3 /3/3 [MÄori Language Commission site]
|
---|
222 | "https://www.tuiatematangi.ac.nz", 4/4 3/3
|
---|
223 | "http://animations.tewhanake.maori.nz", 3/3 3/3
|
---|
224 | !! "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages]
|
---|
225 | !! "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages]
|
---|
226 | "http://www.28maoribattalion.org.nz", 3/3, 1/3
|
---|
227 | "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3
|
---|
228 | "http://www.brettgraham.co.nz", 1/1 total, 0/3
|
---|
229 | !! "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages]
|
---|
230 |
|
---|
231 | "http://anglicanprayerbook.nz", 3/3 3/3
|
---|
232 | "http://arataua.nz", 4/4, 2/3
|
---|
233 | "http://maori.tki.org.nz", 3/3 3/3
|
---|
234 | DONE (with/out www): "http://www.firstworldwar.tki.org.nz",
|
---|
235 | X "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages]
|
---|
236 | "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages]
|
---|
237 | "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages]
|
---|
238 | "https://curriculumtool.education.govt.nz", 4/4, 3/3
|
---|
239 | "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] {includes: "http://www.kurakokiri.maori.nz"}
|
---|
240 | "http://www.kkmmaungarongo.co.nz", 3/3, 3/3
|
---|
241 | "http://www.heartland.co.nz", 3/3, 1/1 total
|
---|
242 | "http://oilcrash.com", 2/2 total, 0/3
|
---|
243 | "http://www.kura-porirua.school.nz", 4/4, 2/3
|
---|
244 | "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages]
|
---|
245 | "https://www.tematawai.maori.nz", 3/3, 3/3
|
---|
246 |
|
---|
247 | + "https://www.terakipaewhenua.school.nz",
|
---|
248 | + "http://www.tetaurawhiri.govt.nz",
|
---|
249 | + "http://archive.stats.govt.nz", (1 page isMRI)
|
---|
250 | + "http://tiritiowaitangi.govt.nz",
|
---|
251 | +!! "http://www.waiata.maori.nz", {includes: "http://waiata.maori.nz"}
|
---|
252 | + "http://hana.co.nz", [crawled version of page contains MRI sentences, but current page is labelled for mobiles whereas in browser it just has a picture]
|
---|
253 | + "http://kaupare.co.nz",
|
---|
254 | + "http://www.tereowrap.nz",
|
---|
255 | ?X "https://www.e-agent.nz", [autotranslated? SEO related site in NZ, Chinese, English and MRI] {includes: "https://office.e-agent.nz"}
|
---|
256 | { included: "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] }
|
---|
257 | + "http://www.hrc.co.nz",
|
---|
258 | + "http://ngatiporoukiponeke.org.nz",
|
---|
259 |
|
---|
260 | + "http://rurued.school.nz",
|
---|
261 | + "http://www.twtop.school.nz",
|
---|
262 | X "https://www.infinite-electronic.nz", [autotranslated product site]
|
---|
263 | +!! "http://www.huri-translations.pf",
|
---|
264 | + "https://admin.teara.govt.nz", {included: "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz]}
|
---|
265 | +!! "https://tiritiowaitangi.govt.nz",
|
---|
266 | + "http://www.tmoa.tki.org.nz",
|
---|
267 | + "https://www.komako.org.nz", [no longer available, but crawled page had a paragraph in Maori with presumably a translation thereafter]
|
---|
268 | + "http://www.wcl.govt.nz", {included: "http://kete.wcl.govt.nz" as wcl.govt.nz; 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3}
|
---|
269 | +!! "http://punareo.co.nz", [waiata]
|
---|
270 |
|
---|
271 | + "https://rapuatearatika.education.govt.nz",
|
---|
272 | + "http://tmmkkm.school.nz",
|
---|
273 | X "https://www.components-mart.nz", [autotranslated product site]
|
---|
274 | + "http://www.cs.waikato.ac.nz", [Te Taka's pages!]
|
---|
275 | +!!! "http://www.kupengahao.co.nz", [MRI language books and resources]
|
---|
276 | + "https://www.hapuhauora.health.nz", [Smokefree site. The one isMRI page has a proper sentence.]
|
---|
277 | X "https://www.lcds-display.nz", [autotranslated product site]
|
---|
278 | + "http://cms.sunsmartschools.co.nz", [as sunsmartschools.co.nz, e.g. http://sunsmartschools.co.nz/kura/how-to-become.html]
|
---|
279 | + "http://kuraproductions.co.nz",
|
---|
280 | + "https://keepourmoneyclean.govt.nz", [1 page]
|
---|
281 |
|
---|
282 | +!! "http://www.tekura.school.nz",
|
---|
283 | + "http://www.tkkmmokopuna.school.nz", [e.g. http://www.tkkmmokopuna.school.nz/newsletter_sets/newsletters/13-wahanga-4-wiki-5-he-puna-korero]
|
---|
284 | + "http://hangaraumatihiko.tki.org.nz", [govt website. e.g. http://hangaraumatihiko.tki.org.nz/te-tupuranga-tangata-me-te-rorohiko/whakatupuranga-3-te-whakatau-tikanga-kawe-i-nga-mahi-whakaoti-hopanga/]
|
---|
285 | + "http://www.pakanae.maori.nz"
|
---|
286 | ],
|
---|
287 | "numPagesInMRICount" : 4360,
|
---|
288 | "numPagesContainingMRICount" : 7968
|
---|
289 | }
|
---|
290 |
|
---|
291 |
|
---|
292 | 96 sites detected as having isMRI pages - 7 sites/subdomains already included in the existing 96 = 89 sites.
|
---|
293 |
|
---|
294 | -2.5* product sites -2 non-MRI sites with songlistings or forms etc
|
---|
295 | *0.5 for e-agent.nz site
|
---|
296 | = 84.5 sites total that at least contain MRI, most have pages inMRI.
|
---|
297 | ----------------------------
|
---|
298 |
|
---|
299 | /* 1 */
|
---|
300 | {
|
---|
301 | "_id" : "nz",
|
---|
302 | "count" : 176.0,
|
---|
303 | "domain" : [
|
---|
304 | !! "http://pukekohe.directorybusiness.co.nz", 0/2, 0/2, isMRI = 0!!
|
---|
305 | "http://maori.livingheritage.org.nz", 2/2 2/2
|
---|
306 | "http://pukoro.co.nz", 2/2 0/2
|
---|
307 | "http://www.rakaumanga.school.nz", 0/4 0/4
|
---|
308 | "http://www.ngamanawainc.co.nz", 0/2 0/2
|
---|
309 | "https://office.e-agent.nz",
|
---|
310 | "https://www.components-mart.nz",
|
---|
311 | "http://tmmkkm.school.nz",
|
---|
312 | "http://www.rotoruanz.com",
|
---|
313 | "http://www.huri-translations.pf",
|
---|
314 | "https://admin.teara.govt.nz",
|
---|
315 | "http://hangaraumatihiko.tki.org.nz",
|
---|
316 | "https://sexualviolence.victimsinfo.govt.nz",
|
---|
317 | "http://www.tekura.school.nz",
|
---|
318 | "http://philipbeadle.co.nz",
|
---|
319 | "http://www.cs.waikato.ac.nz",
|
---|
320 | "https://www.hapuhauora.health.nz",
|
---|
321 | "http://cms.sunsmartschools.co.nz",
|
---|
322 | "https://keepourmoneyclean.govt.nz",
|
---|
323 | "http://www.kura-porirua.school.nz",
|
---|
324 | "http://waitarahistory.org.nz",
|
---|
325 | "http://oilcrash.com",
|
---|
326 | "http://videos.e-agent.nz",
|
---|
327 | "https://manawatuheritage.pncc.govt.nz",
|
---|
328 | "https://www.terakipaewhenua.school.nz",
|
---|
329 | "http://dev.nzpcn.org.nz",
|
---|
330 | "https://kotahimiriona.co.nz",
|
---|
331 | "http://kurakokiri.maori.nz",
|
---|
332 | "https://www.sporty.co.nz",
|
---|
333 | "http://kaupare.co.nz",
|
---|
334 | "http://ngatiporoukiponeke.org.nz",
|
---|
335 | "https://www.takitimu.ac.nz",
|
---|
336 | "http://www.tetaurawhiri.govt.nz",
|
---|
337 | "http://www.waiata.maori.nz",
|
---|
338 | "http://conference.tpwt.maori.nz",
|
---|
339 | "http://ngatiwhakaue.iwi.nz",
|
---|
340 | "http://www.nzpcn.org.nz",
|
---|
341 | "http://www.ruralfind.co.nz",
|
---|
342 | "https://www.dnc.org.nz",
|
---|
343 | "https://www.puau.school.nz",
|
---|
344 | "https://kaiiwicamp.nz",
|
---|
345 | "https://www.terito.school.nz",
|
---|
346 | "https://www.pinterest.nz",
|
---|
347 | "https://e-ako-pangarau.nzmaths.co.nz",
|
---|
348 | "http://givealittle.co.nz",
|
---|
349 | "https://teaomaori.news",
|
---|
350 | "https://www.korokikahukura.co.nz",
|
---|
351 | "http://myfathersworld.net.nz",
|
---|
352 | "http://www.firstworldwar.tki.org.nz",
|
---|
353 | "https://www.ashtangatauranga.co.nz",
|
---|
354 | "http://biketorqueyamaha.co.nz",
|
---|
355 | "https://www.rereahu.maori.nz",
|
---|
356 | "http://www.tewikiotereomaori.co.nz",
|
---|
357 | "http://www.brettgraham.co.nz",
|
---|
358 | "http://tewikiotereomaori.nz",
|
---|
359 | "http://anglicanprayerbook.nz",
|
---|
360 | "http://arataua.nz",
|
---|
361 | "http://blog.teara.govt.nz",
|
---|
362 | "http://www.otepoti.school.nz",
|
---|
363 | "http://www.kmk.maori.nz",
|
---|
364 | "http://www.eventcinemas.co.nz",
|
---|
365 | "https://www.stats.govt.nz",
|
---|
366 | "http://www.oag.govt.nz", 2/2 0/2
|
---|
367 | "http://whatonga.school.nz",
|
---|
368 | "http://www.tewhanake.maori.nz",
|
---|
369 | "https://www.maoritelevision.com",
|
---|
370 | "http://kuraaiwi.maori.nz",
|
---|
371 | "http://kurataiao.tki.org.nz",
|
---|
372 | "http://teaohou.natlib.govt.nz",
|
---|
373 | "http://www.tetaumuturunanga.iwi.nz",
|
---|
374 | "http://www.tasteofplenty.co.nz",
|
---|
375 | "http://community.nzdl.org",
|
---|
376 | "https://www.blushandbrows.nz",
|
---|
377 | "https://register.tpota.org.nz",
|
---|
378 | "https://cdn.tehiku.nz",
|
---|
379 | "http://www.wcl.govt.nz",
|
---|
380 | "http://www.jeremybaker.nz",
|
---|
381 | "http://punareo.co.nz",
|
---|
382 | "https://rapuatearatika.education.govt.nz",
|
---|
383 | "http://www.kurakokiri.maori.nz",
|
---|
384 | "https://www.cruisetourstauranga.co.nz",
|
---|
385 | "https://sooty.nz",
|
---|
386 | "http://rakaumanga.school.nz",
|
---|
387 | "https://tiritiowaitangi.govt.nz",
|
---|
388 | "http://www.tmoa.tki.org.nz",
|
---|
389 | "http://www.w3vietnam.org.nz",
|
---|
390 | "https://www.infinite-electronic.nz",
|
---|
391 | "https://www.komako.org.nz",
|
---|
392 | "http://nzpostcard.co.nz",
|
---|
393 | "http://artizani.co.nz",
|
---|
394 | "http://www.finlaysonpark.school.nz",
|
---|
395 | "http://crimson.co.nz",
|
---|
396 | "http://holyspirit.nz",
|
---|
397 | "http://www.tkkmmokopuna.school.nz",
|
---|
398 | "http://www.pakanae.maori.nz",
|
---|
399 | "http://www.teipukarea.maori.nz",
|
---|
400 | "http://archerpix.com",
|
---|
401 | "https://2019.nethui.nz",
|
---|
402 | "http://www.kupengahao.co.nz",
|
---|
403 | "https://www.lcds-display.nz",
|
---|
404 | "http://waiata.maori.nz",
|
---|
405 | "http://kuraproductions.co.nz",
|
---|
406 | "http://www.biketorqueyamaha.co.nz",
|
---|
407 | "http://www.livingheritage.org.nz",
|
---|
408 | "http://www.zoomin.co.nz",
|
---|
409 | "http://rsnz.natlib.govt.nz",
|
---|
410 | "http://otorohanga.directorybusiness.co.nz",
|
---|
411 | "http://reoora.co.nz",
|
---|
412 | "http://w3vietnam.org.nz",
|
---|
413 | "https://rehuamarae.co.nz",
|
---|
414 | "https://www.electionresults.org.nz",
|
---|
415 | "https://www.ngamanawainc.co.nz",
|
---|
416 | "https://www.rotorua-rafting.co.nz",
|
---|
417 | "https://www.taitokerautrust.org.nz",
|
---|
418 | "https://www.wingspan.co.nz",
|
---|
419 | "http://www.kkmmaungarongo.co.nz",
|
---|
420 | "http://kete.wcl.govt.nz",
|
---|
421 | "http://www.heartland.co.nz",
|
---|
422 | "http://www.electionresults.govt.nz",
|
---|
423 | "https://www.tematawai.maori.nz",
|
---|
424 | "http://hana.co.nz",
|
---|
425 | "http://www.tereowrap.nz",
|
---|
426 | "http://rurued.school.nz",
|
---|
427 | "http://www.twtop.school.nz",
|
---|
428 | "http://rexedra.gen.nz",
|
---|
429 | "http://archive.stats.govt.nz",
|
---|
430 | "https://liveresults.co.nz",
|
---|
431 | "https://www.e-agent.nz",
|
---|
432 | "http://tiritiowaitangi.govt.nz",
|
---|
433 | "http://www.hrc.co.nz",
|
---|
434 | "http://animations.tewhanake.maori.nz",
|
---|
435 | "https://interactives.stuff.co.nz",
|
---|
436 | "http://avonside.net",
|
---|
437 | "http://www.methodist.org.nz",
|
---|
438 | "https://www.tasteofplenty.co.nz",
|
---|
439 | "http://www.maoriinvestments.co.nz",
|
---|
440 | "https://m.wairarapatv.co.nz",
|
---|
441 | "http://www.gans.co.nz",
|
---|
442 | "https://ttw1.cwp.govt.nz",
|
---|
443 | "http://ngarauhuia.ngatiapakiterato.iwi.nz",
|
---|
444 | "https://www.tuiatematangi.ac.nz",
|
---|
445 | "http://tetaurawhiri.govt.nz",
|
---|
446 | "http://maori.tki.org.nz",
|
---|
447 | "http://www.topomap.co.nz",
|
---|
448 | "https://www.puhaandpakeha.co.nz",
|
---|
449 | "https://haereheikaiako.co.nz",
|
---|
450 | "https://paekupu.co.nz",
|
---|
451 | "https://curriculumtool.education.govt.nz",
|
---|
452 | "http://firstworldwar.tki.org.nz",
|
---|
453 | "http://www.28maoribattalion.org.nz",
|
---|
454 | "https://hepatakakupu.nz",
|
---|
455 | "https://www.zenbu.co.nz",
|
---|
456 | "http://www.matarikifestival.org.nz",
|
---|
457 | "http://pukapuka.nz",
|
---|
458 | "http://ngatipahauwera.co.nz", 2/2 2/2
|
---|
459 | "http://southerntribes.co.nz",
|
---|
460 | "https://player.vimeo.com",
|
---|
461 | "http://tmoa.tki.org.nz",
|
---|
462 | "http://www.writersfestival.co.nz",
|
---|
463 | "http://talkingtothecan.com",
|
---|
464 | "https://www.whanau-tahi.school.nz",
|
---|
465 | "http://satellites.co.nz",
|
---|
466 | "http://auturoa.nz",
|
---|
467 | "http://www.tuwharetoa.iwi.nz",
|
---|
468 | "http://kmpmusic.co.nz",
|
---|
469 | "http://www.temarareo.org",
|
---|
470 | "http://archive.electionresults.govt.nz",
|
---|
471 | "http://kaiiwicamp.nz",
|
---|
472 | "http://tehauora.org.nz",
|
---|
473 | "http://temahurehure.maori.nz",
|
---|
474 | "http://www.runanga.co.nz"
|
---|
475 | ],
|
---|
476 | "numPagesInMRICount" : 4360,
|
---|
477 | "numPagesContainingMRICount" : 9641
|
---|
478 | }
|
---|
479 |
|
---|
480 |
|
---|
481 | ----------------------------
|
---|
482 |
|
---|
483 | The remainder: 80 NZ sites detected as not containing pages InMRI, but with positive number of pages detected as containsMRI:
|
---|
484 |
|
---|
485 | db.Websites.aggregate([
|
---|
486 | {
|
---|
487 | $match: {
|
---|
488 | $and: [
|
---|
489 | {numPagesContainingMRI: {$gt: 0}},
|
---|
490 | {numPagesInMRI: {$eq: 0}},
|
---|
491 | {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
|
---|
492 | ]
|
---|
493 | }
|
---|
494 | },
|
---|
495 | { $unwind: "$geoLocationCountryCode" },
|
---|
496 | {
|
---|
497 | $group: {
|
---|
498 | _id: "nz",
|
---|
499 | count: { $sum: 1 },
|
---|
500 | domain: { $addToSet: '$domain' },
|
---|
501 | numPagesInMRICount: { $sum: '$numPagesInMRI' },
|
---|
502 | numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
|
---|
503 | }
|
---|
504 | },
|
---|
505 | { $sort : { count : -1} }
|
---|
506 | ]);
|
---|
507 |
|
---|
508 |
|
---|
509 | Find pages for testing with:
|
---|
510 | db.getCollection('Webpages').find({URL:/ashtangatauranga\.co\.nz/, containsMRI: true, mriSentenceCount: {$gt: 0}})
|
---|
511 |
|
---|
512 |
|
---|
513 | /* 1 */
|
---|
514 | {
|
---|
515 | "_id" : "nz",
|
---|
516 | "count" : 80.0,
|
---|
517 | "domain" : [
|
---|
518 | X "http://www.zoomin.co.nz", [map site, so placenames]
|
---|
519 | X "http://www.biketorqueyamaha.co.nz", [placenames] {includes "http://biketorqueyamaha.co.nz"}
|
---|
520 | X "http://archerpix.com", [photo captions containing placenames]
|
---|
521 | X "http://philipbeadle.co.nz", [art captions containing placenames]
|
---|
522 | X "https://2019.nethui.nz", [Just MRI words in ENG sentences]
|
---|
523 | X "http://crimson.co.nz", [address]
|
---|
524 | + "http://holyspirit.nz", (e.g. https://holyspirit.nz/wp-content/uploads/02-25-2018-Newsletter.pdf)
|
---|
525 | X "https://www.wingspan.co.nz", [1 page, 1 conjoined word, looks like placename]
|
---|
526 | X "http://nzpostcard.co.nz", [postcards with placenames]
|
---|
527 | + "https://www.ngamanawainc.co.nz", (e.g. https://www.ngamanawainc.co.nz/history/id/60) {includes "http://www.ngamanawainc.co.nz"}
|
---|
528 |
|
---|
529 | + "http://www.finlaysonpark.school.nz", [e.g. http://www.finlaysonpark.school.nz/58/pages/61-team-overview; Some Tongan language pages and Samoan pages]
|
---|
530 | X "http://artizani.co.nz", [address]
|
---|
531 | + "http://www.w3vietnam.org.nz", [e.g. http://www.w3vietnam.org.nz/w3%20mihi%20ki%20ka%20hoia.htm] (includes "http://w3vietnam.org.nz")
|
---|
532 | X "https://sooty.nz", [names, war death notices, place names]
|
---|
533 | X? "http://rakaumanga.school.nz", [School has a longer Maori title name, but that's it] {includes "http://www.rakaumanga.school.nz"}
|
---|
534 | X "http://www.rotoruanz.com", [e.g. https://www.rotoruanz.com/RNZ/media/Media-Library/SOI-Board-version-08-June-2018-final-published-on-website.pdf]
|
---|
535 | X "https://www.cruisetourstauranga.co.nz", [English with Tauranga being MRI placename]
|
---|
536 | X "http://www.jeremybaker.nz", [one word, HOkio]
|
---|
537 |
|
---|
538 | X "https://liveresults.co.nz", [canoe sports team names]
|
---|
539 | X "http://rexedra.gen.nz", [ENG sentence with MRI words]
|
---|
540 | + "https://www.takitimu.ac.nz", [school of performing arts, example MRI sentence at https://www.takitimu.ac.nz/about-us]
|
---|
541 | X "http://www.electionresults.govt.nz", [placenames and misdetection of Return to Home Page] {includes https://www.electionresults.org.nz which seems to be an alias for the same} {includes "http://archive.electionresults.govt.nz"}
|
---|
542 | + "https://kotahimiriona.co.nz", e.g. (https://kotahimiriona.co.nz/about-the-app/)
|
---|
543 | + "https://rehuamarae.co.nz", (e.g. https://rehuamarae.co.nz/)
|
---|
544 | + "http://reoora.co.nz", (e.g. https://reoora.co.nz/about/)
|
---|
545 |
|
---|
546 | X "http://otorohanga.directorybusiness.co.nz", [placenames]
|
---|
547 | X "http://waitarahistory.org.nz", [placenames and misdetection of sentences of the form "31, No 262." as MRI]
|
---|
548 | + "https://manawatuheritage.pncc.govt.nz", (e.g. https://manawatuheritage.pncc.govt.nz/about)
|
---|
549 | + "http://rsnz.natlib.govt.nz", [e.g. http://rsnz.natlib.govt.nz/volume/rsnz_26/rsnz_26_00_004560.html] NOTE: site appears to use Greenstone
|
---|
550 | X "https://www.rotorua-rafting.co.nz", [placenames]
|
---|
551 | + "https://www.taitokerautrust.org.nz", (e.g. https://www.taitokerautrust.org.nz/)
|
---|
552 | + "http://tewikiotereomaori.nz", (e.g. https://tewikiotereomaori.nz/about/)
|
---|
553 | + "https://www.korokikahukura.co.nz", (e.g. https://www.korokikahukura.co.nz/ko-wai-m257tou.html about the connection to the Waikato River)
|
---|
554 |
|
---|
555 | X "https://www.puhaandpakeha.co.nz", [ENG sentences with MRI words]
|
---|
556 | X "http://myfathersworld.net.nz", [placenames]
|
---|
557 | X "https://www.ashtangatauranga.co.nz", [misdetection]
|
---|
558 | + "https://www.pinterest.nz", (e.g. sentence "E tu Whare Rangi, e tu Whare Pataka, kia ora koutou nga Whare Tipuna o te Ao Maori." at https://www.pinterest.nz/amp/marylizw03/maori/)
|
---|
559 | + "https://www.rereahu.maori.nz", (e.g. https://www.rereahu.maori.nz/uploads/72717/files/168860/Te_Maru_o_Rereahu_Newsletter_No_5_February_2011.pdf)
|
---|
560 | + "http://givealittle.co.nz", (e.g. https://givealittle.co.nz/cause/rebuild-tapu-te-ranga-marae which contains greetings phrases and sentence ""NÄ te ringa tangata i hanga te whare NÄ te tuarÄ o te whare i whakatipu i te tangata")
|
---|
561 | X "http://www.gans.co.nz", [placenames]
|
---|
562 | + "https://kaiiwicamp.nz", [placenames] {includes "http://kaiiwicamp.nz"}
|
---|
563 | + "http://ngarauhuia.ngatiapakiterato.iwi.nz", (e.g. http://ngarauhuia.ngatiapakiterato.iwi.nz/pdf/Nga-Rau-Huia_Raumati_2015_Issue1.pdf)
|
---|
564 | + "https://m.wairarapatv.co.nz", (e.g. https://m.wairarapatv.co.nz/archive/i/JGmJqCZCNK8/te-ara-whanui-kura-kaupapa-maori-o-nga-khanga-reo-o-te-awa-kairangi)
|
---|
565 |
|
---|
566 | X "http://www.methodist.org.nz", [ENG sentence with MRI words]
|
---|
567 | + "http://avonside.net", (e.g. http://avonside.net/Nicky/indexmain.htm)
|
---|
568 | X "http://www.ruralfind.co.nz", [placenames]
|
---|
569 | + "http://www.maoriinvestments.co.nz", (e.g. Vision, Mission & Values tab of https://maoriinvestments.co.nz/about/organisation)
|
---|
570 | + "http://conference.tpwt.maori.nz", (e.g. https://www.tpwt.maori.nz/te-mahe-matauranga/)
|
---|
571 | + "https://www.puau.school.nz", (e.g. https://www.puau.school.nz/k%C4%81inga-home)
|
---|
572 | +? "http://ngatiwhakaue.iwi.nz", (e.g. greetings at http://ngatiwhakaue.iwi.nz/)
|
---|
573 | X "http://www.nzpcn.org.nz", [lots of misdetections and one sentence in English with MRI words, e.g. "Kaori is a Taonga to Maori".] {includes "http://dev.nzpcn.org.nz"}
|
---|
574 | +? "https://interactives.stuff.co.nz", [2 x page mostly empty except for the title "TE WIKI O TE REO MÄORI MÄori"]
|
---|
575 | + "http://tehauora.org.nz", (e.g. http://tehauora.org.nz/ and proverb at bottom of http://tehauora.org.nz/about-us)
|
---|
576 |
|
---|
577 | + "http://temahurehure.maori.nz", (e.g. greeting http://temahurehure.maori.nz/site/file/temp/hirebook.pdf)
|
---|
578 | X "http://pukekohe.directorybusiness.co.nz", [placenames]
|
---|
579 | +!! "http://www.temarareo.org", (dictionary with sample sentences. e.g. http://www.temarareo.org/PAPAKUPU/dictionary-searchresults/A-2.htm)
|
---|
580 | X "https://www.tasteofplenty.co.nz", [Tauranga placename and misdetection of "No one went away hungry] {includes "http://www.tasteofplenty.co.nz"}
|
---|
581 |
|
---|
582 | + "http://www.tetaumuturunanga.iwi.nz", (e.g. http://www.tetaumuturunanga.iwi.nz/wp-content/uploads/2016/04/Malvern-Cultural-Narrative-Draft.pdf)
|
---|
583 |
|
---|
584 |
|
---|
585 | X "https://www.blushandbrows.nz", [misdetection of "Makeup..."]
|
---|
586 | X "http://talkingtothecan.com", [misdetection of things like ""22, no." and mistaking ENG sentences with MRI words]
|
---|
587 |
|
---|
588 | +? "http://whatonga.school.nz", [school title]
|
---|
589 | +? "https://player.vimeo.com", [Video titles contain MRI sentences and MRI titles interspersed in ENG sentence. Video's audio content may be in MRI]
|
---|
590 | + "http://www.writersfestival.co.nz", (e.g. http://www.writersfestival.co.nz/news/Page1/urgently-relevant-novel-wins-countrys-richest-literary-award/)
|
---|
591 | +? "http://southerntribes.co.nz", [brazilian jiu jitsu site with a greeting in MRI on main page]
|
---|
592 | + "http://www.kmk.maori.nz", (e.g. http://www.kmk.maori.nz/kmk-events)
|
---|
593 | + "https://www.stats.govt.nz", (e.g. http://archive.stats.govt.nz/Census/2001-census-data/2001-census-pacific-profiles/cook-island-maori-people-in-new-zealand.aspx)
|
---|
594 | X "http://www.eventcinemas.co.nz", [placenames and misdetection of ENG "Atura Hotels"]
|
---|
595 | X "https://www.zenbu.co.nz" [misdetection and NZ school addresses]
|
---|
596 | ],
|
---|
597 | "numPagesInMRICount" : 0,
|
---|
598 | "numPagesContainingMRICount" : 1673
|
---|
599 | }
|
---|