source: other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json@ 33872

Last change on this file since 33872 was 33872, checked in by ak19, 4 years ago
  1. Added the file containing the 255 random NZ page URLs to sample. 2. Minor updates to 2 existing counts files. 3. Recorded isMRI aggregate command used for selecting NZ domains to sample from - for NZ sites did not use containsMRI to generate samples.
File size: 19.7 KB
Line 
1/*
2
3db.Websites.aggregate([
4 {
5 $match: {
6 $and: [
7 {numPagesInMRI: {$gt: 0}},
8 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
9 ]
10 }
11 },
12 { $unwind: "$geoLocationCountryCode" },
13 {
14 $group: {
15 _id: "nz",
16 count: { $sum: 1 },
17 domain: { $addToSet: '$domain' },
18 numPagesInMRICount: { $sum: '$numPagesInMRI' },
19 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
20 }
21 },
22 { $sort : { count : -1} }
23]);
24
25For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted.
26
27For all but NZ, get final column results with:
28 db.getCollection('Websites').find({domain:/coggle\.it/})
29And can check for URLs with:
30 db.getCollection('Webpages').find({URL: /coggle\.it/, isMRI: true})
31
32
33NOTES:
341. DE:
35
36"de","2.0","0+1","9+35 misdetected", http://www.cartogiraffe.com, https://www.cartogiraffe.com,
37Ought to be 2+2 numPagesInMRICount and 9+2 numPagesContainingMRICount:
38- both cartogiraffe.com pages were identical and had mostly MRI sentences with one name not being MRI. So isMRI should have been true for both pages.
39- Only one of the 2 MRI translations of the universal declaration of human rights at http://www.udhr.de got downloaded. A total of 75 pages were downloaded, but more translated pages appeared to be on the webpage. Not sure why the crawl had a _SUCCESS file to indicate completed download.
40- Then http://www.udhr.de had 35-1 non-MRI language translations of the universal declaration of human rights where one or more sentences were misdetected as MRI. With the additional MRI page that didn't get downloaded, should have 9+2 = 11 pages containing MRI.
41
42So instead of
43"de","2.0","1","44", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
44"de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
45
46
47"au","3.0",7+0+1,83+1+3,https://www.kiwiproperty.com, https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd,https://koreromaori.com
48
492. US:
50aclhokiangarocks.blogspot.com contains at least a page with MRI paragraphs. See http://aclhokiangarocks.blogspot.com/feeds/posts/default under section "Nga Tuhinga o tatou Tupuna"
51Although this page has been crawled by Nutch, the contents were presented in the blog in a complex way and therefore the text wasn't retrieved here. See also the dedicated page this text should have been in http://aclhokiangarocks.blogspot.com/2012/05/nga-tuhinga-o-tatou-tupuna.html
52
53"_id","siteCount","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
54"nz","176.0" containsMRI vs 96 pages inMRI,"4360","9641" in 176 containsMRI pages vs 7968 in isMRI pages
55"us","29.0",
56 1+2+0+0+4+166+0+39 +257+2+21+12+25+13+53+0+1+0+1+11 +32+37+4 +0+0+0 = 681,
57 31+2+2+20+58+166+3+91 +258+2+25+12+66+22+53+6+1+1+2+10 +58+54+6 +1+2+1 = 953,
58 anglicanhistory.org,unicode.org,static-promote.weebly.com,aclhokiangarocks.blogspot.com,bahaiprayers.net,biblehub.com,muhammad.com,godrules.net,m.biblepub.com, krassotkin.ru,gotquestions.org,
59 maorinews.com,maaori.com,kiaorahola.blogspot.com,kjohnsonnz.blogspot.com,pumanawawhangara.blogspot.com,dannykahei.tripod.com,burkekm001.tripod.com,tkkpipipaopao.blogspot.com, manateina.blogspot.com,
60 tatai09.blogspot.com,twttoa.com,tuhua2010.blogspot.com,
61 breaker.audio,drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview,in.pinterest.com/pin/317363104978423418/
62"au","2.0","8","86", https://www.kiwiproperty.com, https://koreromaori.com
63"de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
64"dk","2.0","4","7", *.ngapuhitelevision.com, *.ngapuhiradio.com
65"bg","1.0","2","2", http://anitra.net/activism/humanrights/UDHR/mbf_print.htm, http://anitra.net/activism/humanrights/UDHR/rrt_print.htm
66"cz","1.0","0","1", http://www.henryklahola.nazory.cz/094.Maori.htm, http://henryklahola.nazory.cz/094.Maori.htm
67"es","1.0","1","1", https://www.uv.es/~pla/red.net/intmaori.html
68"fr","1.0","1","1", http://chantsdeluttes.free.fr/versionsinter/page%20maori.html
69"ie","1.0","1","3", https://coggle.it/diagram/WSYB0mLA2QABD5BH/t/ko-au-ko-koe
70
71
72--------------
73
74 https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/#CI1
75 https://stats.stackexchange.com/questions/207584/sample-size-choice-with-binary-outcome
76 https://www.statisticshowto.datasciencecentral.com/z-alpha2-za2/
77
78 N (NZ pages where isMRI comes out true) = 4360
79 solving for n, the sample size
80 confidence level = 90%
81 m, margin of error = 5%
82
83 From the "z alpha/2" table, for 90% confidence, we get a z alpha/2 value of 1.6449 (or 1.645).
84
85 Then the sample size, n, we need is = 1.6449^2 * 4360 / ( 1.6449^2 + (4 * 4359) * 0.05^2) = 255 (rounded up)
86
87
88 For N = 681,
89 sample size n is = 1.6449^2 * 681 / ( 1.6449^2 + (4 * 680) * 0.05^2) = 194 (rounded up)
90
91
92 sample size for NZ: 255 (90% confidence with 5% margine of error, Including a finite correction factor)
93 sample size for US: 194
94
95*/
96
97
98
99"_id","siteCount containsMRI","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
100"nz","176.0","4360","9641"
101"us","29.0","681","953"
102"au","2.0","8","86"
103"de","2.0","4","11"
104"dk","2.0","4","7"
105"bg","1.0","2","2"
106"cz","1.0","0","1"
107"es","1.0","1","1"
108"fr","1.0","1","1"
109"ie","1.0","1","3"
110
111Total sites containing MRI: 216
112[of which 96 isMRI sites from NZ]
113Total pages detected as being in MRI: 5062
114Total pages detected as containing MRI sentences: 10706
115
116
117
118NZ - sample 255 pages from:
119/*
120db.Websites.aggregate([
121 {
122 $match: {
123 $and: [
124 {numPagesContainingMRI: {$gt: 0}},
125 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
126 ]
127 }
128 },
129 { $unwind: "$geoLocationCountryCode" },
130 {
131 $group: {
132 _id: "nz",
133 count: { $sum: 1 },
134 domain: { $addToSet: '$domain' },
135 numPagesInMRICount: { $sum: '$numPagesInMRI' },
136 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
137 }
138 },
139 { $sort : { count : -1} }
140]);
141
142
143OR is this better (only numPagesINMRI):
144
145db.Websites.aggregate([
146 {
147 $match: {
148 $and: [
149 {numPagesInMRI: {$gt: 0}},
150 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
151 ]
152 }
153 },
154 { $unwind: "$geoLocationCountryCode" },
155 {
156 $group: {
157 _id: "nz",
158 count: { $sum: 1 },
159 domain: { $addToSet: '$domain' },
160 numPagesInMRICount: { $sum: '$numPagesInMRI' },
161 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
162 }
163 },
164 { $sort : { count : -1} }
165]);
166*/
167
168num NZ sites with > 0 isMRI pages = 96
169Total numPagesInMRI in NZ sites = 4360
170Total numPagesContainingMRI in NZ sites = 7968
171
172Using the results you get a list of domains that matched. 171 nz domains, though it should be 176? -1
173
174Copy each domain (up to 255 of them) and look for the first 1 or 2 max that matches isMRI:
175
1761. db.getCollection('Webpages').find({URL:/pukekohe.directorybusiness.co.nz/, isMRI: true}) - check it contains a positive number of pages in MRI and check the first 1-2 pages to make sure they are indeed in MRI. Note down the ratio of MRI finds. e.g. 2/2.
177
1782. Find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI. Note down the ratio for the first 2 pages.
179db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true})
180
181
182
183/* 1 */
184{
185 "_id" : "nz",
186 "count" : 96.0,
187 "domain" : [
188 "http://www.teipukarea.maori.nz", 3/3 1/3
189 "http://ngatipahauwera.co.nz", 2/2, 2/2
190 "http://www.oag.govt.nz", 2/2 0/2
191 "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3
192 "http://tmoa.tki.org.nz", 3/3 3/3
193 "http://www.tewhanake.maori.nz", 3/3 2/3
194 "http://www.matarikifestival.org.nz", 4/4 0/3
195 "http://www.otepoti.school.nz", 3/3 0/4
196!! "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages]
197 "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages]
198 "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence]
199!! "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI song titles] 0 [no other pages containsMRI]
200 "http://maori.livingheritage.org.nz", 2/2 2/2
201 "http://pukoro.co.nz", 2/2 0/2
202 "https://register.tpota.org.nz", 0/1 [form] 0/2
203X "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz", 0/4, 1/3 [but audio content may be in MRI]
204!! "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages]
205! "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3
206 "http://kurataiao.tki.org.nz", 3/3, 1/total 3
207
208!! "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages]
209 "http://teaohou.natlib.govt.nz", 4/4, 2/4
210 "http://www.tuwharetoa.iwi.nz", 2/3 0/3
211X "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY
212 "https://www.terito.school.nz", 3/3, 0/2 total
213 "https://ttw1.cwp.govt.nz", 3/3 3/3
214 "https://www.whanau-tahi.school.nz", 4/4, 1/2 total
215 "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total
216 "https://teaomaori.news", 3/3, 0/1 total
217 "http://tetaurawhiri.govt.nz", 3/3 /3/3 [Māori Language Commission site]
218 "https://www.tuiatematangi.ac.nz", 4/4 3/3
219 "http://animations.tewhanake.maori.nz", 3/3 3/3
220!! "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages]
221!! "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages]
222 "http://www.28maoribattalion.org.nz", 3/3, 1/3
223 "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3
224 "http://www.brettgraham.co.nz", 1/1 total, 0/3
225!! "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages]
226
227 "http://anglicanprayerbook.nz", 3/3 3/3
228 "http://arataua.nz", 4/4, 2/3
229 "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz]
230 "http://maori.tki.org.nz", 3/3 3/3
231DONE (with/out www): "http://www.firstworldwar.tki.org.nz",
232X "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages]
233 "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages]
234 "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages]
235 "https://curriculumtool.education.govt.nz", 4/4, 3/3
236 "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page]
237 "http://kete.wcl.govt.nz", 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3
238 "http://www.kkmmaungarongo.co.nz", 3/3, 3/3
239 "http://www.heartland.co.nz", 3/3, 1/1 total
240 "http://oilcrash.com", 2/2 total, 0/3
241 "http://www.kura-porirua.school.nz", 4/4, 2/3
242 "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav]
243 "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages]
244 "https://www.tematawai.maori.nz", 3/3, 3/3
245
246 "https://www.terakipaewhenua.school.nz",
247 "http://www.tetaurawhiri.govt.nz",
248 "http://archive.stats.govt.nz",
249 "http://tiritiowaitangi.govt.nz",
250 "http://www.waiata.maori.nz",
251 "http://hana.co.nz",
252 "http://kaupare.co.nz",
253 "http://www.tereowrap.nz",
254 "https://www.e-agent.nz",
255 "http://www.hrc.co.nz",
256 "http://ngatiporoukiponeke.org.nz",
257 "http://rurued.school.nz",
258 "http://www.twtop.school.nz",
259 "https://www.infinite-electronic.nz",
260 "http://www.huri-translations.pf",
261 "https://admin.teara.govt.nz",
262 "https://tiritiowaitangi.govt.nz",
263 "http://www.tmoa.tki.org.nz",
264 "https://www.komako.org.nz",
265 "http://www.wcl.govt.nz",
266 "https://office.e-agent.nz",
267 "http://punareo.co.nz",
268 "http://www.kurakokiri.maori.nz",
269 "https://rapuatearatika.education.govt.nz",
270 "http://tmmkkm.school.nz",
271 "https://www.components-mart.nz",
272 "http://www.cs.waikato.ac.nz",
273 "http://www.kupengahao.co.nz",
274 "https://www.hapuhauora.health.nz",
275 "https://www.lcds-display.nz",
276 "http://waiata.maori.nz",
277 "http://cms.sunsmartschools.co.nz",
278 "http://www.livingheritage.org.nz",
279 "http://kuraproductions.co.nz",
280 "https://keepourmoneyclean.govt.nz",
281 "http://www.tekura.school.nz",
282 "http://www.tkkmmokopuna.school.nz",
283 "http://hangaraumatihiko.tki.org.nz",
284 "http://www.pakanae.maori.nz"
285 ],
286 "numPagesInMRICount" : 4360,
287 "numPagesContainingMRICount" : 7968
288}
289
290----------------------------
291
292/* 1 */
293{
294 "_id" : "nz",
295 "count" : 176.0,
296 "domain" : [
297!! "http://pukekohe.directorybusiness.co.nz", 0/2, 0/2, isMRI = 0!!
298 "http://maori.livingheritage.org.nz", 2/2 2/2
299 "http://pukoro.co.nz", 2/2 0/2
300 "http://www.rakaumanga.school.nz", 0/4 0/4
301 "http://www.ngamanawainc.co.nz", 0/2 0/2
302 "https://office.e-agent.nz",
303 "https://www.components-mart.nz",
304 "http://tmmkkm.school.nz",
305 "http://www.rotoruanz.com",
306 "http://www.huri-translations.pf",
307 "https://admin.teara.govt.nz",
308 "http://hangaraumatihiko.tki.org.nz",
309 "https://sexualviolence.victimsinfo.govt.nz",
310 "http://www.tekura.school.nz",
311 "http://philipbeadle.co.nz",
312 "http://www.cs.waikato.ac.nz",
313 "https://www.hapuhauora.health.nz",
314 "http://cms.sunsmartschools.co.nz",
315 "https://keepourmoneyclean.govt.nz",
316 "http://www.kura-porirua.school.nz",
317 "http://waitarahistory.org.nz",
318 "http://oilcrash.com",
319 "http://videos.e-agent.nz",
320 "https://manawatuheritage.pncc.govt.nz",
321 "https://www.terakipaewhenua.school.nz",
322 "http://dev.nzpcn.org.nz",
323 "https://kotahimiriona.co.nz",
324 "http://kurakokiri.maori.nz",
325 "https://www.sporty.co.nz",
326 "http://kaupare.co.nz",
327 "http://ngatiporoukiponeke.org.nz",
328 "https://www.takitimu.ac.nz",
329 "http://www.tetaurawhiri.govt.nz",
330 "http://www.waiata.maori.nz",
331 "http://conference.tpwt.maori.nz",
332 "http://ngatiwhakaue.iwi.nz",
333 "http://www.nzpcn.org.nz",
334 "http://www.ruralfind.co.nz",
335 "https://www.dnc.org.nz",
336 "https://www.puau.school.nz",
337 "https://kaiiwicamp.nz",
338 "https://www.terito.school.nz",
339 "https://www.pinterest.nz",
340 "https://e-ako-pangarau.nzmaths.co.nz",
341 "http://givealittle.co.nz",
342 "https://teaomaori.news",
343 "https://www.korokikahukura.co.nz",
344 "http://myfathersworld.net.nz",
345 "http://www.firstworldwar.tki.org.nz",
346 "https://www.ashtangatauranga.co.nz",
347 "http://biketorqueyamaha.co.nz",
348 "https://www.rereahu.maori.nz",
349 "http://www.tewikiotereomaori.co.nz",
350 "http://www.brettgraham.co.nz",
351 "http://tewikiotereomaori.nz",
352 "http://anglicanprayerbook.nz",
353 "http://arataua.nz",
354 "http://blog.teara.govt.nz",
355 "http://www.otepoti.school.nz",
356 "http://www.kmk.maori.nz",
357 "http://www.eventcinemas.co.nz",
358 "https://www.stats.govt.nz",
359 "http://www.oag.govt.nz", 2/2 0/2
360 "http://whatonga.school.nz",
361 "http://www.tewhanake.maori.nz",
362 "https://www.maoritelevision.com",
363 "http://kuraaiwi.maori.nz",
364 "http://kurataiao.tki.org.nz",
365 "http://teaohou.natlib.govt.nz",
366 "http://www.tetaumuturunanga.iwi.nz",
367 "http://www.tasteofplenty.co.nz",
368 "http://community.nzdl.org",
369 "https://www.blushandbrows.nz",
370 "https://register.tpota.org.nz",
371 "https://cdn.tehiku.nz",
372 "http://www.wcl.govt.nz",
373 "http://www.jeremybaker.nz",
374 "http://punareo.co.nz",
375 "https://rapuatearatika.education.govt.nz",
376 "http://www.kurakokiri.maori.nz",
377 "https://www.cruisetourstauranga.co.nz",
378 "https://sooty.nz",
379 "http://rakaumanga.school.nz",
380 "https://tiritiowaitangi.govt.nz",
381 "http://www.tmoa.tki.org.nz",
382 "http://www.w3vietnam.org.nz",
383 "https://www.infinite-electronic.nz",
384 "https://www.komako.org.nz",
385 "http://nzpostcard.co.nz",
386 "http://artizani.co.nz",
387 "http://www.finlaysonpark.school.nz",
388 "http://crimson.co.nz",
389 "http://holyspirit.nz",
390 "http://www.tkkmmokopuna.school.nz",
391 "http://www.pakanae.maori.nz",
392 "http://www.teipukarea.maori.nz",
393 "http://archerpix.com",
394 "https://2019.nethui.nz",
395 "http://www.kupengahao.co.nz",
396 "https://www.lcds-display.nz",
397 "http://waiata.maori.nz",
398 "http://kuraproductions.co.nz",
399 "http://www.biketorqueyamaha.co.nz",
400 "http://www.livingheritage.org.nz",
401 "http://www.zoomin.co.nz",
402 "http://rsnz.natlib.govt.nz",
403 "http://otorohanga.directorybusiness.co.nz",
404 "http://reoora.co.nz",
405 "http://w3vietnam.org.nz",
406 "https://rehuamarae.co.nz",
407 "https://www.electionresults.org.nz",
408 "https://www.ngamanawainc.co.nz",
409 "https://www.rotorua-rafting.co.nz",
410 "https://www.taitokerautrust.org.nz",
411 "https://www.wingspan.co.nz",
412 "http://www.kkmmaungarongo.co.nz",
413 "http://kete.wcl.govt.nz",
414 "http://www.heartland.co.nz",
415 "http://www.electionresults.govt.nz",
416 "https://www.tematawai.maori.nz",
417 "http://hana.co.nz",
418 "http://www.tereowrap.nz",
419 "http://rurued.school.nz",
420 "http://www.twtop.school.nz",
421 "http://rexedra.gen.nz",
422 "http://archive.stats.govt.nz",
423 "https://liveresults.co.nz",
424 "https://www.e-agent.nz",
425 "http://tiritiowaitangi.govt.nz",
426 "http://www.hrc.co.nz",
427 "http://animations.tewhanake.maori.nz",
428 "https://interactives.stuff.co.nz",
429 "http://avonside.net",
430 "http://www.methodist.org.nz",
431 "https://www.tasteofplenty.co.nz",
432 "http://www.maoriinvestments.co.nz",
433 "https://m.wairarapatv.co.nz",
434 "http://www.gans.co.nz",
435 "https://ttw1.cwp.govt.nz",
436 "http://ngarauhuia.ngatiapakiterato.iwi.nz",
437 "https://www.tuiatematangi.ac.nz",
438 "http://tetaurawhiri.govt.nz",
439 "http://maori.tki.org.nz",
440 "http://www.topomap.co.nz",
441 "https://www.puhaandpakeha.co.nz",
442 "https://haereheikaiako.co.nz",
443 "https://paekupu.co.nz",
444 "https://curriculumtool.education.govt.nz",
445 "http://firstworldwar.tki.org.nz",
446 "http://www.28maoribattalion.org.nz",
447 "https://hepatakakupu.nz",
448 "https://www.zenbu.co.nz",
449 "http://www.matarikifestival.org.nz",
450 "http://pukapuka.nz",
451 "http://ngatipahauwera.co.nz", 2/2 2/2
452 "http://southerntribes.co.nz",
453 "https://player.vimeo.com",
454 "http://tmoa.tki.org.nz",
455 "http://www.writersfestival.co.nz",
456 "http://talkingtothecan.com",
457 "https://www.whanau-tahi.school.nz",
458 "http://satellites.co.nz",
459 "http://auturoa.nz",
460 "http://www.tuwharetoa.iwi.nz",
461 "http://kmpmusic.co.nz",
462 "http://www.temarareo.org",
463 "http://archive.electionresults.govt.nz",
464 "http://kaiiwicamp.nz",
465 "http://tehauora.org.nz",
466 "http://temahurehure.maori.nz",
467 "http://www.runanga.co.nz"
468 ],
469 "numPagesInMRICount" : 4360,
470 "numPagesContainingMRICount" : 9641
471}
472
473
Note: See TracBrowser for help on using the repository browser.