root/other-projects/maori-lang-detection/mongodb-data/6table_nonProductSites1_manualShortlist.json @ 33868

Revision 33868, 19.1 KB (checked in by ak19, 2 months ago)

With the updated code for generating the maps from 6a and 6b manual site counts, generated corrected maps for num PAGES in MRI and num PAGES containing MRI and their geojson files. (Also some tabbing to 6table file).

Line 
1/*
2For sites originating in NZ or with nz TLD, none of the URLs are manually inspected and all URLs are accepted.
3
4For all but NZ, get final column results with:
5    db.getCollection('Websites').find({domain:/coggle\.it/})
6And can check for URLs with:
7    db.getCollection('Webpages').find({URL: /coggle\.it/, isMRI: true})
8
9
10NOTES:
111. DE:
12
13"de","2.0","0+1","9+35 misdetected", http://www.cartogiraffe.com, https://www.cartogiraffe.com,
14Ought to be 2+2 numPagesInMRICount and 9+2 numPagesContainingMRICount:
15- both cartogiraffe.com pages were identical and had mostly MRI sentences with one name not being MRI. So isMRI should have been true for both pages.
16- Only one of the 2 MRI translations of the universal declaration of human rights at http://www.udhr.de got downloaded. A total of 75 pages were downloaded, but more translated pages appeared to be on the webpage. Not sure why the crawl had a _SUCCESS file to indicate completed download.
17- Then http://www.udhr.de had 35-1 non-MRI language translations of the universal declaration of human rights where one or more sentences were misdetected as MRI. With the additional MRI page that didn't get downloaded, should have 9+2 = 11 pages containing MRI.
18
19So instead of
20"de","2.0","1","44", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
21"de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
22
23
24"au","3.0",7+0+1,83+1+3,https://www.kiwiproperty.com, https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd,https://koreromaori.com
25
262. US:
27aclhokiangarocks.blogspot.com contains at least a page with MRI paragraphs. See http://aclhokiangarocks.blogspot.com/feeds/posts/default under section "Nga Tuhinga o tatou Tupuna"
28Although this page has been crawled by Nutch, the contents were presented in the blog in a complex way and therefore the text wasn't retrieved here. See also the dedicated page this text should have been in http://aclhokiangarocks.blogspot.com/2012/05/nga-tuhinga-o-tatou-tupuna.html
29
30"_id","siteCount","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
31"nz","176.0" containsMRI vs 96 pages inMRI,"4360","9641" in 176 containsMRI pages vs 7968 in isMRI pages
32"us","29.0",
33    1+2+0+0+4+166+0+39 +257+2+21+12+25+13+53+0+1+0+1+11 +32+37+4 +0+0+0 = 681,
34    31+2+2+20+58+166+3+91 +258+2+25+12+66+22+53+6+1+1+2+10 +58+54+6 +1+2+1 = 953,
35    anglicanhistory.org,unicode.org,static-promote.weebly.com,aclhokiangarocks.blogspot.com,bahaiprayers.net,biblehub.com,muhammad.com,godrules.net,m.biblepub.com, krassotkin.ru,gotquestions.org,
36      maorinews.com,maaori.com,kiaorahola.blogspot.com,kjohnsonnz.blogspot.com,pumanawawhangara.blogspot.com,dannykahei.tripod.com,burkekm001.tripod.com,tkkpipipaopao.blogspot.com, manateina.blogspot.com,
37    tatai09.blogspot.com,twttoa.com,tuhua2010.blogspot.com,
38    breaker.audio,drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview,in.pinterest.com/pin/317363104978423418/
39"au","2.0","8","86", https://www.kiwiproperty.com, https://koreromaori.com
40"de","2.0","4","11", http://www.cartogiraffe.com, https://www.cartogiraffe.com, http://www.udhr.de
41"dk","2.0","4","7", *.ngapuhitelevision.com, *.ngapuhiradio.com
42"bg","1.0","2","2", http://anitra.net/activism/humanrights/UDHR/mbf_print.htm, http://anitra.net/activism/humanrights/UDHR/rrt_print.htm
43"cz","1.0","0","1", http://www.henryklahola.nazory.cz/094.Maori.htm, http://henryklahola.nazory.cz/094.Maori.htm
44"es","1.0","1","1", https://www.uv.es/~pla/red.net/intmaori.html
45"fr","1.0","1","1", http://chantsdeluttes.free.fr/versionsinter/page%20maori.html
46"ie","1.0","1","3", https://coggle.it/diagram/WSYB0mLA2QABD5BH/t/ko-au-ko-koe
47
48
49--------------
50
51    https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/#CI1
52    https://stats.stackexchange.com/questions/207584/sample-size-choice-with-binary-outcome
53    https://www.statisticshowto.datasciencecentral.com/z-alpha2-za2/
54
55    N (NZ pages where isMRI comes out true) = 4360
56    solving for n, the sample size
57    confidence level = 90%
58    m, margin of error = 5%
59
60    From the "z alpha/2" table, for 90% confidence, we get a z alpha/2 value of 1.6449 (or 1.645).
61
62    Then the sample size, n, we need is = 1.6449^2 * 4360 / ( 1.6449^2 + (4 * 4359) * 0.05^2) = 255 (rounded up)
63
64
65    For N = 681,
66    sample size n is = 1.6449^2 * 681 / ( 1.6449^2 + (4 * 680) * 0.05^2) = 194 (rounded up)
67
68
69    sample size for NZ: 255 (90% confidence with 5% margine of error, Including a finite correction factor)
70    sample size for US: 194
71
72*/
73
74
75
76"_id","siteCount containsMRI","numPagesInMRICount","numPagesContainingMRICount","URLs of pages detected as inMRI"
77"nz","176.0","4360","9641"
78"us","29.0","681","953"
79"au","2.0","8","86"
80"de","2.0","4","11"
81"dk","2.0","4","7"
82"bg","1.0","2","2"
83"cz","1.0","0","1"
84"es","1.0","1","1"
85"fr","1.0","1","1"
86"ie","1.0","1","3"
87
88Total sites containing MRI: 216
89[of which 96 isMRI sites from NZ]
90Total pages detected as being in MRI: 5062
91Total pages detected as containing MRI sentences: 10706
92
93
94
95NZ - sample 255 pages from:
96/*
97db.Websites.aggregate([
98    {
99        $match: {
100            $and: [
101                {numPagesContainingMRI: {$gt: 0}},
102                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
103            ]
104        }
105    },
106    { $unwind: "$geoLocationCountryCode" },
107    {
108        $group: {
109            _id: "nz",
110            count: { $sum: 1 },
111            domain: { $addToSet: '$domain' },
112            numPagesInMRICount: { $sum: '$numPagesInMRI' },
113            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
114        }
115    },
116    { $sort : { count : -1} }
117]);
118
119
120OR is this better:
121
122db.Websites.aggregate([
123    {
124        $match: {
125            $and: [
126                {numPagesInMRI: {$gt: 0}},
127                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
128            ]
129        }
130    },
131    { $unwind: "$geoLocationCountryCode" },
132    {
133        $group: {
134            _id: "nz",
135            count: { $sum: 1 },
136            domain: { $addToSet: '$domain' },
137            numPagesInMRICount: { $sum: '$numPagesInMRI' },
138            numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
139        }
140    },
141    { $sort : { count : -1} }
142]);
143*/
144
145num NZ sites with > 0 isMRI pages = 96
146Total numPagesInMRI in NZ sites = 4360
147Total numPagesContainingMRI in NZ sites = 7968
148
149Using the results you get a list of domains that matched. 171 nz domains, though it should be 176? -1
150
151Copy each domain (up to 255 of them) and look for the first 1 or 2 max that matches isMRI:
152
1531. db.getCollection('Webpages').find({URL:/pukekohe.directorybusiness.co.nz/, isMRI: true}) - check it contains a positive number of pages in MRI and check the first 1-2 pages to make sure they are indeed in MRI. Note down the ratio of MRI finds. e.g. 2/2.
154
1552. Find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI. Note down the ratio for the first 2 pages.
156db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true})
157
158
159
160/* 1 */
161{
162    "_id" : "nz",
163    "count" : 96.0,
164    "domain" : [
165        "http://www.teipukarea.maori.nz", 3/3 1/3
166        "http://ngatipahauwera.co.nz", 2/2, 2/2
167        "http://www.oag.govt.nz", 2/2 0/2
168        "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3
169        "http://tmoa.tki.org.nz", 3/3 3/3
170        "http://www.tewhanake.maori.nz", 3/3 2/3
171        "http://www.matarikifestival.org.nz", 4/4 0/3
172        "http://www.otepoti.school.nz", 3/3 0/4
173!!        "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages]
174        "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages]
175        "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence]
176!!        "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI song titles] 0 [no other pages containsMRI]
177        "http://maori.livingheritage.org.nz", 2/2 2/2
178        "http://pukoro.co.nz", 2/2 0/2
179        "https://register.tpota.org.nz", 0/1 [form] 0/2
180X        "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz",  0/4, 1/3 [but audio content may be in MRI]
181!!        "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages]
182!        "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3
183        "http://kurataiao.tki.org.nz", 3/3, 1/total 3
184
185!!        "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages]
186        "http://teaohou.natlib.govt.nz", 4/4, 2/4
187        "http://www.tuwharetoa.iwi.nz", 2/3 0/3
188X        "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY
189        "https://www.terito.school.nz", 3/3, 0/2 total
190        "https://ttw1.cwp.govt.nz", 3/3 3/3
191        "https://www.whanau-tahi.school.nz", 4/4, 1/2 total
192        "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total
193        "https://teaomaori.news", 3/3, 0/1 total
194        "http://tetaurawhiri.govt.nz", 3/3 /3/3 [Māori Language Commission site]
195        "https://www.tuiatematangi.ac.nz", 4/4 3/3
196        "http://animations.tewhanake.maori.nz", 3/3 3/3
197!!       "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages]
198!!        "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages]
199        "http://www.28maoribattalion.org.nz", 3/3, 1/3
200        "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3
201        "http://www.brettgraham.co.nz", 1/1 total, 0/3
202!!        "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages]
203
204        "http://anglicanprayerbook.nz", 3/3 3/3
205        "http://arataua.nz", 4/4, 2/3
206        "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz]
207        "http://maori.tki.org.nz", 3/3 3/3
208DONE (with/out www):        "http://www.firstworldwar.tki.org.nz",
209X        "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages]
210        "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages]
211        "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages]
212        "https://curriculumtool.education.govt.nz", 4/4, 3/3
213        "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page]
214        "http://kete.wcl.govt.nz", 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3
215        "http://www.kkmmaungarongo.co.nz", 3/3, 3/3
216        "http://www.heartland.co.nz", 3/3, 1/1 total
217        "http://oilcrash.com", 2/2 total, 0/3
218        "http://www.kura-porirua.school.nz", 4/4, 2/3
219        "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav]
220        "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages]
221        "https://www.tematawai.maori.nz", 3/3, 3/3
222
223        "https://www.terakipaewhenua.school.nz",
224        "http://www.tetaurawhiri.govt.nz",
225        "http://archive.stats.govt.nz",
226        "http://tiritiowaitangi.govt.nz",
227        "http://www.waiata.maori.nz",
228        "http://hana.co.nz",
229        "http://kaupare.co.nz",
230        "http://www.tereowrap.nz",
231        "https://www.e-agent.nz",
232        "http://www.hrc.co.nz",
233        "http://ngatiporoukiponeke.org.nz",
234        "http://rurued.school.nz",
235        "http://www.twtop.school.nz",
236        "https://www.infinite-electronic.nz",
237        "http://www.huri-translations.pf",
238        "https://admin.teara.govt.nz",
239        "https://tiritiowaitangi.govt.nz",
240        "http://www.tmoa.tki.org.nz",
241        "https://www.komako.org.nz",
242        "http://www.wcl.govt.nz",
243        "https://office.e-agent.nz",
244        "http://punareo.co.nz",
245        "http://www.kurakokiri.maori.nz",
246        "https://rapuatearatika.education.govt.nz",
247        "http://tmmkkm.school.nz",
248        "https://www.components-mart.nz",
249        "http://www.cs.waikato.ac.nz",
250        "http://www.kupengahao.co.nz",
251        "https://www.hapuhauora.health.nz",
252        "https://www.lcds-display.nz",
253        "http://waiata.maori.nz",
254        "http://cms.sunsmartschools.co.nz",
255        "http://www.livingheritage.org.nz",
256        "http://kuraproductions.co.nz",
257        "https://keepourmoneyclean.govt.nz",
258        "http://www.tekura.school.nz",
259        "http://www.tkkmmokopuna.school.nz",
260        "http://hangaraumatihiko.tki.org.nz",
261        "http://www.pakanae.maori.nz"
262    ],
263    "numPagesInMRICount" : 4360,
264    "numPagesContainingMRICount" : 7968
265}
266
267----------------------------
268
269/* 1 */
270{
271    "_id" : "nz",
272    "count" : 176.0,
273    "domain" : [
274!!        "http://pukekohe.directorybusiness.co.nz", 0/2, 0/2, isMRI = 0!!
275        "http://maori.livingheritage.org.nz", 2/2 2/2
276        "http://pukoro.co.nz", 2/2 0/2
277        "http://www.rakaumanga.school.nz", 0/4 0/4
278        "http://www.ngamanawainc.co.nz", 0/2 0/2
279        "https://office.e-agent.nz",
280        "https://www.components-mart.nz",
281        "http://tmmkkm.school.nz",
282        "http://www.rotoruanz.com",
283        "http://www.huri-translations.pf",
284        "https://admin.teara.govt.nz",
285        "http://hangaraumatihiko.tki.org.nz",
286        "https://sexualviolence.victimsinfo.govt.nz",
287        "http://www.tekura.school.nz",
288        "http://philipbeadle.co.nz",
289        "http://www.cs.waikato.ac.nz",
290        "https://www.hapuhauora.health.nz",
291        "http://cms.sunsmartschools.co.nz",
292        "https://keepourmoneyclean.govt.nz",
293        "http://www.kura-porirua.school.nz",
294        "http://waitarahistory.org.nz",
295        "http://oilcrash.com",
296        "http://videos.e-agent.nz",
297        "https://manawatuheritage.pncc.govt.nz",
298        "https://www.terakipaewhenua.school.nz",
299        "http://dev.nzpcn.org.nz",
300        "https://kotahimiriona.co.nz",
301        "http://kurakokiri.maori.nz",
302        "https://www.sporty.co.nz",
303        "http://kaupare.co.nz",
304        "http://ngatiporoukiponeke.org.nz",
305        "https://www.takitimu.ac.nz",
306        "http://www.tetaurawhiri.govt.nz",
307        "http://www.waiata.maori.nz",
308        "http://conference.tpwt.maori.nz",
309        "http://ngatiwhakaue.iwi.nz",
310        "http://www.nzpcn.org.nz",
311        "http://www.ruralfind.co.nz",
312        "https://www.dnc.org.nz",
313        "https://www.puau.school.nz",
314        "https://kaiiwicamp.nz",
315        "https://www.terito.school.nz",
316        "https://www.pinterest.nz",
317        "https://e-ako-pangarau.nzmaths.co.nz",
318        "http://givealittle.co.nz",
319        "https://teaomaori.news",
320        "https://www.korokikahukura.co.nz",
321        "http://myfathersworld.net.nz",
322        "http://www.firstworldwar.tki.org.nz",
323        "https://www.ashtangatauranga.co.nz",
324        "http://biketorqueyamaha.co.nz",
325        "https://www.rereahu.maori.nz",
326        "http://www.tewikiotereomaori.co.nz",
327        "http://www.brettgraham.co.nz",
328        "http://tewikiotereomaori.nz",
329        "http://anglicanprayerbook.nz",
330        "http://arataua.nz",
331        "http://blog.teara.govt.nz",
332        "http://www.otepoti.school.nz",
333        "http://www.kmk.maori.nz",
334        "http://www.eventcinemas.co.nz",
335        "https://www.stats.govt.nz",
336        "http://www.oag.govt.nz", 2/2 0/2
337        "http://whatonga.school.nz",
338        "http://www.tewhanake.maori.nz",
339        "https://www.maoritelevision.com",
340        "http://kuraaiwi.maori.nz",
341        "http://kurataiao.tki.org.nz",
342        "http://teaohou.natlib.govt.nz",
343        "http://www.tetaumuturunanga.iwi.nz",
344        "http://www.tasteofplenty.co.nz",
345        "http://community.nzdl.org",
346        "https://www.blushandbrows.nz",
347        "https://register.tpota.org.nz",
348        "https://cdn.tehiku.nz",
349        "http://www.wcl.govt.nz",
350        "http://www.jeremybaker.nz",
351        "http://punareo.co.nz",
352        "https://rapuatearatika.education.govt.nz",
353        "http://www.kurakokiri.maori.nz",
354        "https://www.cruisetourstauranga.co.nz",
355        "https://sooty.nz",
356        "http://rakaumanga.school.nz",
357        "https://tiritiowaitangi.govt.nz",
358        "http://www.tmoa.tki.org.nz",
359        "http://www.w3vietnam.org.nz",
360        "https://www.infinite-electronic.nz",
361        "https://www.komako.org.nz",
362        "http://nzpostcard.co.nz",
363        "http://artizani.co.nz",
364        "http://www.finlaysonpark.school.nz",
365        "http://crimson.co.nz",
366        "http://holyspirit.nz",
367        "http://www.tkkmmokopuna.school.nz",
368        "http://www.pakanae.maori.nz",
369        "http://www.teipukarea.maori.nz",
370        "http://archerpix.com",
371        "https://2019.nethui.nz",
372        "http://www.kupengahao.co.nz",
373        "https://www.lcds-display.nz",
374        "http://waiata.maori.nz",
375        "http://kuraproductions.co.nz",
376        "http://www.biketorqueyamaha.co.nz",
377        "http://www.livingheritage.org.nz",
378        "http://www.zoomin.co.nz",
379        "http://rsnz.natlib.govt.nz",
380        "http://otorohanga.directorybusiness.co.nz",
381        "http://reoora.co.nz",
382        "http://w3vietnam.org.nz",
383        "https://rehuamarae.co.nz",
384        "https://www.electionresults.org.nz",
385        "https://www.ngamanawainc.co.nz",
386        "https://www.rotorua-rafting.co.nz",
387        "https://www.taitokerautrust.org.nz",
388        "https://www.wingspan.co.nz",
389        "http://www.kkmmaungarongo.co.nz",
390        "http://kete.wcl.govt.nz",
391        "http://www.heartland.co.nz",
392        "http://www.electionresults.govt.nz",
393        "https://www.tematawai.maori.nz",
394        "http://hana.co.nz",
395        "http://www.tereowrap.nz",
396        "http://rurued.school.nz",
397        "http://www.twtop.school.nz",
398        "http://rexedra.gen.nz",
399        "http://archive.stats.govt.nz",
400        "https://liveresults.co.nz",
401        "https://www.e-agent.nz",
402        "http://tiritiowaitangi.govt.nz",
403        "http://www.hrc.co.nz",
404        "http://animations.tewhanake.maori.nz",
405        "https://interactives.stuff.co.nz",
406        "http://avonside.net",
407        "http://www.methodist.org.nz",
408        "https://www.tasteofplenty.co.nz",
409        "http://www.maoriinvestments.co.nz",
410        "https://m.wairarapatv.co.nz",
411        "http://www.gans.co.nz",
412        "https://ttw1.cwp.govt.nz",
413        "http://ngarauhuia.ngatiapakiterato.iwi.nz",
414        "https://www.tuiatematangi.ac.nz",
415        "http://tetaurawhiri.govt.nz",
416        "http://maori.tki.org.nz",
417        "http://www.topomap.co.nz",
418        "https://www.puhaandpakeha.co.nz",
419        "https://haereheikaiako.co.nz",
420        "https://paekupu.co.nz",
421        "https://curriculumtool.education.govt.nz",
422        "http://firstworldwar.tki.org.nz",
423        "http://www.28maoribattalion.org.nz",
424        "https://hepatakakupu.nz",
425        "https://www.zenbu.co.nz",
426        "http://www.matarikifestival.org.nz",
427        "http://pukapuka.nz",
428        "http://ngatipahauwera.co.nz", 2/2 2/2
429        "http://southerntribes.co.nz",
430        "https://player.vimeo.com",
431        "http://tmoa.tki.org.nz",
432        "http://www.writersfestival.co.nz",
433        "http://talkingtothecan.com",
434        "https://www.whanau-tahi.school.nz",
435        "http://satellites.co.nz",
436        "http://auturoa.nz",
437        "http://www.tuwharetoa.iwi.nz",
438        "http://kmpmusic.co.nz",
439        "http://www.temarareo.org",
440        "http://archive.electionresults.govt.nz",
441        "http://kaiiwicamp.nz",
442        "http://tehauora.org.nz",
443        "http://temahurehure.maori.nz",
444        "http://www.runanga.co.nz"
445    ],
446    "numPagesInMRICount" : 4360,
447    "numPagesContainingMRICount" : 9641
448}
449
450
Note: See TracBrowser for help on using the browser.