root/other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json @ 33854

Revision 33854, 19.1 KB (checked in by ak19, 2 months ago)

Manually gone over around 150 webpages of sample size of 255 webpages from NZ checking whether those for which isMRI=true was detected is indeed the case. Also have been sampling an almost equal number of NZ webpages for which isMRI=false yet containsMRI=true.

Line 
1/*
2For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted.
3
4For all but NZ, get final column results with:
5    db.getCollection('Websites').find({domain:/coggle\.it/})
6And can check for URLs with:
7    db.getCollection('Webpages').find({URL: /coggle\.it/, isMRI: true})
8
9
10NOTES:
111. DE:
12
13"de","2.0","0+1","9+35 misdetected", http://www.cartogiraffe.com, https://www.cartogiraffe.com,
14Ought to be 2+2 numPagesInMRICount and 9+2 numPagesContainingMRICount:
15- both cartogiraffe.com pages were identical and had mostly MRI sentences with one name not being MRI. So isMRI should have been true for both pages.
16- Only one of the 2 MRI translations of the universal declaration of human rights at http://www.udhr.de got downloaded. A total of 75 pages were downloaded, but more translated pages appeared to be on the webpage. Not sure why the crawl had a _SUCCESS file to indicate completed download.
17- Then http://www.udhr.de had 35-1 non-MRI language translations of the universal declaration of human rights where one or more sentences were misdetected as MRI. With the additional MRI page that didn't get downloaded, should have 9+2 = 11 pages containing MRI.
18
19So instead of
20"de","2.0","1","44", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
21"de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
22
23
24"au","3.0",7+0+1,83+1+3,https://www.kiwiproperty.com, https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd,https://koreromaori.com
25
262. US:
27aclhokiangarocks.blogspot.com contains at least a page with MRI paragraphs. See http://aclhokiangarocks.blogspot.com/feeds/posts/default under section "Nga Tuhinga o tatou Tupuna"
28Although this page has been crawled by Nutch, the contents were presented in the blog in a complex way and therefore the text wasn't retrieved here. See also the dedicated page this text should have been in http://aclhokiangarocks.blogspot.com/2012/05/nga-tuhinga-o-tatou-tupuna.html
29
30"_id","siteCount","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
31"nz","176.0" containsMRI vs 96 pages inMRI,"4360","9641" in 176 containsMRI pages vs 7968 in isMRI pages
32"us","29.0",
33    1+2+0+0+4+166+0+39 +257+2+21+12+25+13+53+0+1+0+1+11 +32+37+4 +0+0+0 = 681,
34    31+2+2+20+58+166+3+91 +258+2+25+12+66+22+53+6+1+1+2+10 +58+54+6 +1+2+1 = 953,
35    anglicanhistory.org,unicode.org,static-promote.weebly.com,aclhokiangarocks.blogspot.com,bahaiprayers.net,biblehub.com,muhammad.com,godrules.net,m.biblepub.com, krassotkin.ru,gotquestions.org,
36      maorinews.com,maaori.com,kiaorahola.blogspot.com,kjohnsonnz.blogspot.com,pumanawawhangara.blogspot.com,dannykahei.tripod.com,burkekm001.tripod.com,tkkpipipaopao.blogspot.com, manateina.blogspot.com,
37    tatai09.blogspot.com,twttoa.com,tuhua2010.blogspot.com,
38    breaker.audio,drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview,in.pinterest.com/pin/317363104978423418/
39"au","2.0","8","86", https://www.kiwiproperty.com, https://koreromaori.com
40"de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
41"dk","2.0","4","7", *.ngapuhitelevision.com, *.ngapuhiradio.com
42"bg","1.0","2","2", http://anitra.net/activism/humanrights/UDHR/mbf_print.htm, http://anitra.net/activism/humanrights/UDHR/rrt_print.htm
43"cz","1.0","0","1", http://www.henryklahola.nazory.cz/094.Maori.htm, http://henryklahola.nazory.cz/094.Maori.htm
44"es","1.0","1","1", https://www.uv.es/~pla/red.net/intmaori.html
45"fr","1.0","1","1", http://chantsdeluttes.free.fr/versionsinter/page%20maori.html
46"ie","1.0","1","3", https://coggle.it/diagram/WSYB0mLA2QABD5BH/t/ko-au-ko-koe
47
48
49
50
51
52--------------
53
54https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/#CI1
55https://stats.stackexchange.com/questions/207584/sample-size-choice-with-binary-outcome
56https://www.statisticshowto.datasciencecentral.com/z-alpha2-za2/
57
58N (NZ pages where isMRI comes out true) = 4360
59solving for n, the sample size
60confidence level = 90%
61m, margin of error = 5%
62
63From the "z alpha/2" table, for 90% confidence, we get a z alpha/2 value of 1.6449 (or 1.645).
64
65Then the sample size, n, we need is = 1.6449^2 * 4360 / ( 1.6449^2 + (4 * 4359) * 0.05^2) = 255 (rounded up)
66
67
68For N = 681,
69sample size n is = 1.6449^2 * 681 / ( 1.6449^2 + (4 * 680) * 0.05^2) = 194 (rounded up)
70
71
72sample size for NZ: 255 (90% confidence with 5% margine of error, Including a finite correction factor)
73sample size for US: 194
74
75*/
76
77
78
79"_id","siteCount","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
80"nz","176.0","4360","9641"
81"us","29.0","681","953"
82"au","2.0","8","86"
83"de","2.0","4","11"
84"dk","2.0","4","7"
85"bg","1.0","2","2"
86"cz","1.0","0","1"
87"es","1.0","1","1"
88"fr","1.0","1","1"
89"ie","1.0","1","3"
90
91Total sites containing MRI: 216
92Total pages detected as being in MRI: 5062
93Total pages detected as containing MRI sentences: 10706
94
95
96
97NZ - sample 255 pages from:
98/*
99db.Websites.aggregate([
100    {
101        $match: {
102            $and: [
103                {numPagesContainingMRI: {$gt: 0}},
104                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
105            ]
106        }
107    },
108    { $unwind: "$geoLocationCountryCode" },
109    {
110        $group: {
111            _id: "nz",
112            count: { $sum: 1 },
113            domain: { $addToSet: '$domain' },
114            numPagesInMRICount: { $sum: '$numPagesInMRI' },
115            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
116        }
117    },
118    { $sort : { count : -1} }
119]);
120
121
122OR is this better:
123
124db.Websites.aggregate([
125    {
126        $match: {
127            $and: [
128                {numPagesInMRI: {$gt: 0}},
129                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
130            ]
131        }
132    },
133    { $unwind: "$geoLocationCountryCode" },
134    {
135        $group: {
136            _id: "nz",
137            count: { $sum: 1 },
138            domain: { $addToSet: '$domain' },
139            numPagesInMRICount: { $sum: '$numPagesInMRI' },
140            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
141        }
142    },
143    { $sort : { count : -1} }
144]);
145*/
146
147num NZ sites with > 0 isMRI pages = 96
148Total numPagesInMRI in NZ sites = 4360
149Total numPagesContainingMRI in NZ sites = 7968
150
151Using the results you get a list of domains that matched. 171 nz domains, though it should be 176? -1
152
153Copy each domain (up to 255 of them) and look for the first 1 or 2 max that matches isMRI:
154
1551. db.getCollection('Webpages').find({URL:/pukekohe.directorybusiness.co.nz/, isMRI: true}) - check it contains a positive number of pages in MRI and check the first 1-2 pages to make sure they are indeed in MRI. Note down the ratio of MRI finds. e.g. 2/2.
156
1572. Find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI. Note down the ratio for the first 2 pages.
158db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true})
159
160
161
162/* 1 */
163{
164    "_id" : "nz",
165    "count" : 96.0,
166    "domain" : [
167        "http://www.teipukarea.maori.nz", 3/3 1/3
168        "http://ngatipahauwera.co.nz", 2/2, 2/2
169        "http://www.oag.govt.nz", 2/2 0/2
170        "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3
171        "http://tmoa.tki.org.nz", 3/3 3/3
172        "http://www.tewhanake.maori.nz", 3/3 2/3
173        "http://www.matarikifestival.org.nz", 4/4 0/3
174        "http://www.otepoti.school.nz", 3/3 0/4
175!!        "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages]
176        "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages]
177        "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence]
178!!        "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI song titles] 0 [no other pages containsMRI]
179        "http://maori.livingheritage.org.nz", 2/2 2/2
180        "http://pukoro.co.nz", 2/2 0/2
181        "https://register.tpota.org.nz", 0/1 [form] 0/2
182X        "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz",  0/4, 1/3 [but audio content may be in MRI]
183!!        "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages]
184!        "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3
185        "http://kurataiao.tki.org.nz", 3/3, 1/total 3
186
187!!        "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages]
188        "http://teaohou.natlib.govt.nz", 4/4, 2/4
189        "http://www.tuwharetoa.iwi.nz", 2/3 0/3
190X        "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY
191        "https://www.terito.school.nz", 3/3, 0/2 total
192        "https://ttw1.cwp.govt.nz", 3/3 3/3
193        "https://www.whanau-tahi.school.nz", 4/4, 1/2 total
194        "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total
195        "https://teaomaori.news", 3/3, 0/1 total
196        "http://tetaurawhiri.govt.nz", 3/3 /3/3 [Māori Language Commission site]
197        "https://www.tuiatematangi.ac.nz", 4/4 3/3
198        "http://animations.tewhanake.maori.nz", 3/3 3/3
199!!       "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages]
200!!        "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages]
201        "http://www.28maoribattalion.org.nz", 3/3, 1/3
202        "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3
203        "http://www.brettgraham.co.nz", 1/1 total, 0/3
204!!        "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages]
205
206        "http://anglicanprayerbook.nz", 3/3 3/3
207        "http://arataua.nz", 4/4, 2/3
208        "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz]
209        "http://maori.tki.org.nz", 3/3 3/3
210DONE (with/out www):        "http://www.firstworldwar.tki.org.nz",
211X        "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages]
212        "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages]
213        "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages]
214        "https://curriculumtool.education.govt.nz", 4/4, 3/3
215        "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page]
216        "http://kete.wcl.govt.nz", 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3
217        "http://www.kkmmaungarongo.co.nz", 3/3, 3/3
218        "http://www.heartland.co.nz", 3/3, 1/1 total
219        "http://oilcrash.com", 2/2 total, 0/3
220        "http://www.kura-porirua.school.nz", 4/4, 2/3
221        "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav]
222        "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages]
223        "https://www.tematawai.maori.nz", 3/3, 3/3
224
225        "https://www.terakipaewhenua.school.nz",
226        "http://www.tetaurawhiri.govt.nz",
227        "http://archive.stats.govt.nz",
228        "http://tiritiowaitangi.govt.nz",
229        "http://www.waiata.maori.nz",
230        "http://hana.co.nz",
231        "http://kaupare.co.nz",
232        "http://www.tereowrap.nz",
233        "https://www.e-agent.nz",
234        "http://www.hrc.co.nz",
235        "http://ngatiporoukiponeke.org.nz",
236        "http://rurued.school.nz",
237        "http://www.twtop.school.nz",
238        "https://www.infinite-electronic.nz",
239        "http://www.huri-translations.pf",
240        "https://admin.teara.govt.nz",
241        "https://tiritiowaitangi.govt.nz",
242        "http://www.tmoa.tki.org.nz",
243        "https://www.komako.org.nz",
244        "http://www.wcl.govt.nz",
245        "https://office.e-agent.nz",
246        "http://punareo.co.nz",
247        "http://www.kurakokiri.maori.nz",
248        "https://rapuatearatika.education.govt.nz",
249        "http://tmmkkm.school.nz",
250        "https://www.components-mart.nz",
251        "http://www.cs.waikato.ac.nz",
252        "http://www.kupengahao.co.nz",
253        "https://www.hapuhauora.health.nz",
254        "https://www.lcds-display.nz",
255        "http://waiata.maori.nz",
256        "http://cms.sunsmartschools.co.nz",
257        "http://www.livingheritage.org.nz",
258        "http://kuraproductions.co.nz",
259        "https://keepourmoneyclean.govt.nz",
260        "http://www.tekura.school.nz",
261        "http://www.tkkmmokopuna.school.nz",
262        "http://hangaraumatihiko.tki.org.nz",
263        "http://www.pakanae.maori.nz"
264    ],
265    "numPagesInMRICount" : 4360,
266    "numPagesContainingMRICount" : 7968
267}
268
269----------------------------
270
271/* 1 */
272{
273    "_id" : "nz",
274    "count" : 176.0,
275    "domain" : [
276!!        "http://pukekohe.directorybusiness.co.nz", 0/2, 0/2, isMRI = 0!!
277        "http://maori.livingheritage.org.nz", 2/2 2/2
278        "http://pukoro.co.nz", 2/2 0/2
279        "http://www.rakaumanga.school.nz", 0/4 0/4
280        "http://www.ngamanawainc.co.nz", 0/2 0/2
281        "https://office.e-agent.nz",
282        "https://www.components-mart.nz",
283        "http://tmmkkm.school.nz",
284        "http://www.rotoruanz.com",
285        "http://www.huri-translations.pf",
286        "https://admin.teara.govt.nz",
287        "http://hangaraumatihiko.tki.org.nz",
288        "https://sexualviolence.victimsinfo.govt.nz",
289        "http://www.tekura.school.nz",
290        "http://philipbeadle.co.nz",
291        "http://www.cs.waikato.ac.nz",
292        "https://www.hapuhauora.health.nz",
293        "http://cms.sunsmartschools.co.nz",
294        "https://keepourmoneyclean.govt.nz",
295        "http://www.kura-porirua.school.nz",
296        "http://waitarahistory.org.nz",
297        "http://oilcrash.com",
298        "http://videos.e-agent.nz",
299        "https://manawatuheritage.pncc.govt.nz",
300        "https://www.terakipaewhenua.school.nz",
301        "http://dev.nzpcn.org.nz",
302        "https://kotahimiriona.co.nz",
303        "http://kurakokiri.maori.nz",
304        "https://www.sporty.co.nz",
305        "http://kaupare.co.nz",
306        "http://ngatiporoukiponeke.org.nz",
307        "https://www.takitimu.ac.nz",
308        "http://www.tetaurawhiri.govt.nz",
309        "http://www.waiata.maori.nz",
310        "http://conference.tpwt.maori.nz",
311        "http://ngatiwhakaue.iwi.nz",
312        "http://www.nzpcn.org.nz",
313        "http://www.ruralfind.co.nz",
314        "https://www.dnc.org.nz",
315        "https://www.puau.school.nz",
316        "https://kaiiwicamp.nz",
317        "https://www.terito.school.nz",
318        "https://www.pinterest.nz",
319        "https://e-ako-pangarau.nzmaths.co.nz",
320        "http://givealittle.co.nz",
321        "https://teaomaori.news",
322        "https://www.korokikahukura.co.nz",
323        "http://myfathersworld.net.nz",
324        "http://www.firstworldwar.tki.org.nz",
325        "https://www.ashtangatauranga.co.nz",
326        "http://biketorqueyamaha.co.nz",
327        "https://www.rereahu.maori.nz",
328        "http://www.tewikiotereomaori.co.nz",
329        "http://www.brettgraham.co.nz",
330        "http://tewikiotereomaori.nz",
331        "http://anglicanprayerbook.nz",
332        "http://arataua.nz",
333        "http://blog.teara.govt.nz",
334        "http://www.otepoti.school.nz",
335        "http://www.kmk.maori.nz",
336        "http://www.eventcinemas.co.nz",
337        "https://www.stats.govt.nz",
338        "http://www.oag.govt.nz", 2/2 0/2
339        "http://whatonga.school.nz",
340        "http://www.tewhanake.maori.nz",
341        "https://www.maoritelevision.com",
342        "http://kuraaiwi.maori.nz",
343        "http://kurataiao.tki.org.nz",
344        "http://teaohou.natlib.govt.nz",
345        "http://www.tetaumuturunanga.iwi.nz",
346        "http://www.tasteofplenty.co.nz",
347        "http://community.nzdl.org",
348        "https://www.blushandbrows.nz",
349        "https://register.tpota.org.nz",
350        "https://cdn.tehiku.nz",
351        "http://www.wcl.govt.nz",
352        "http://www.jeremybaker.nz",
353        "http://punareo.co.nz",
354        "https://rapuatearatika.education.govt.nz",
355        "http://www.kurakokiri.maori.nz",
356        "https://www.cruisetourstauranga.co.nz",
357        "https://sooty.nz",
358        "http://rakaumanga.school.nz",
359        "https://tiritiowaitangi.govt.nz",
360        "http://www.tmoa.tki.org.nz",
361        "http://www.w3vietnam.org.nz",
362        "https://www.infinite-electronic.nz",
363        "https://www.komako.org.nz",
364        "http://nzpostcard.co.nz",
365        "http://artizani.co.nz",
366        "http://www.finlaysonpark.school.nz",
367        "http://crimson.co.nz",
368        "http://holyspirit.nz",
369        "http://www.tkkmmokopuna.school.nz",
370        "http://www.pakanae.maori.nz",
371        "http://www.teipukarea.maori.nz",
372        "http://archerpix.com",
373        "https://2019.nethui.nz",
374        "http://www.kupengahao.co.nz",
375        "https://www.lcds-display.nz",
376        "http://waiata.maori.nz",
377        "http://kuraproductions.co.nz",
378        "http://www.biketorqueyamaha.co.nz",
379        "http://www.livingheritage.org.nz",
380        "http://www.zoomin.co.nz",
381        "http://rsnz.natlib.govt.nz",
382        "http://otorohanga.directorybusiness.co.nz",
383        "http://reoora.co.nz",
384        "http://w3vietnam.org.nz",
385        "https://rehuamarae.co.nz",
386        "https://www.electionresults.org.nz",
387        "https://www.ngamanawainc.co.nz",
388        "https://www.rotorua-rafting.co.nz",
389        "https://www.taitokerautrust.org.nz",
390        "https://www.wingspan.co.nz",
391        "http://www.kkmmaungarongo.co.nz",
392        "http://kete.wcl.govt.nz",
393        "http://www.heartland.co.nz",
394        "http://www.electionresults.govt.nz",
395        "https://www.tematawai.maori.nz",
396        "http://hana.co.nz",
397        "http://www.tereowrap.nz",
398        "http://rurued.school.nz",
399        "http://www.twtop.school.nz",
400        "http://rexedra.gen.nz",
401        "http://archive.stats.govt.nz",
402        "https://liveresults.co.nz",
403        "https://www.e-agent.nz",
404        "http://tiritiowaitangi.govt.nz",
405        "http://www.hrc.co.nz",
406        "http://animations.tewhanake.maori.nz",
407        "https://interactives.stuff.co.nz",
408        "http://avonside.net",
409        "http://www.methodist.org.nz",
410        "https://www.tasteofplenty.co.nz",
411        "http://www.maoriinvestments.co.nz",
412        "https://m.wairarapatv.co.nz",
413        "http://www.gans.co.nz",
414        "https://ttw1.cwp.govt.nz",
415        "http://ngarauhuia.ngatiapakiterato.iwi.nz",
416        "https://www.tuiatematangi.ac.nz",
417        "http://tetaurawhiri.govt.nz",
418        "http://maori.tki.org.nz",
419        "http://www.topomap.co.nz",
420        "https://www.puhaandpakeha.co.nz",
421        "https://haereheikaiako.co.nz",
422        "https://paekupu.co.nz",
423        "https://curriculumtool.education.govt.nz",
424        "http://firstworldwar.tki.org.nz",
425        "http://www.28maoribattalion.org.nz",
426        "https://hepatakakupu.nz",
427        "https://www.zenbu.co.nz",
428        "http://www.matarikifestival.org.nz",
429        "http://pukapuka.nz",
430        "http://ngatipahauwera.co.nz", 2/2 2/2
431        "http://southerntribes.co.nz",
432        "https://player.vimeo.com",
433        "http://tmoa.tki.org.nz",
434        "http://www.writersfestival.co.nz",
435        "http://talkingtothecan.com",
436        "https://www.whanau-tahi.school.nz",
437        "http://satellites.co.nz",
438        "http://auturoa.nz",
439        "http://www.tuwharetoa.iwi.nz",
440        "http://kmpmusic.co.nz",
441        "http://www.temarareo.org",
442        "http://archive.electionresults.govt.nz",
443        "http://kaiiwicamp.nz",
444        "http://tehauora.org.nz",
445        "http://temahurehure.maori.nz",
446        "http://www.runanga.co.nz"
447    ],
448    "numPagesInMRICount" : 4360,
449    "numPagesContainingMRICount" : 9641
450}
451
452
Note: See TracBrowser for help on using the browser.