source: other-projects/maori-lang-detection/mongodb-data/ManualShortlisting2.txt@ 33907

Last change on this file since 33907 was 33907, checked in by ak19, 4 years ago

See previous commit message. This will be the file with the results for the data reingested into MongoDB

File size: 75.5 KB
Line 
1Want to MANUALLY go over all sites that are detected as containing one or more pages with at least an MRI sentence
2and shortlist those sites genuinely containing at least one MRI sentence.
3
4
5Total num sites detected as containing MRI:
6 db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
7=869
8
9
10To make the manual task easier,
11splitting the results of all sites with numPagesContainingMRI > 0 into NZ sites and overseas sites,
12since NZ sites are more likely to contain MRI content.
13
14-----------------------------------------------------------
15A. OVERSEAS SITES: sites not NZ in origin NOR .nz TLD SITES
16-----------------------------------------------------------
17Further splitting the overseas sites into a set with an mi in the URL path (mi.* or */mi) and those without,
18since overseas sites with mi in the URL path are more likely to be automatically translated product sites.
19
201. db.getCollection('Websites').find(
21{$and: [
22 {numPagesContainingMRI: {$gt: 0}},
23 {geoLocationCountryCode: {$ne: "NZ"}},
24 {domain: {$not: /.nz$/}},
25 {urlContainsLangCodeInPath: {$ne: true}}
26]}).count()
27
28= 221 websites
29
30[Treating Australia as a special case since one of the 4 Australian sites with numPagesContainingMRI > 0
31had an mi in the URL path but was not automatically translated
32
33# counts by country code excluding NZ related sites
34
35db.getCollection('Websites').find({$and: [
36 {geoLocationCountryCode: {$ne: "NZ"}},
37 {domain: {$not: /\.nz/}},
38 {numPagesContainingMRI: {$gt: 0}},
39 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
40 ]}).count()
41
42= 221 websites
43
44
45
46Getting a domain listing of the sites that matched, per country:
47db.Websites.aggregate([
48 {
49 $match: {
50 $and: [
51 {geoLocationCountryCode: {$ne: "NZ"}},
52 {domain: {$not: /\.nz/}},
53 {numPagesContainingMRI: {$gt: 0}},
54 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
55 ]
56 }
57 },
58 { $unwind: "$geoLocationCountryCode" },
59 {
60 $group: {
61 _id: {$toLower: '$geoLocationCountryCode'},
62 count: { $sum: 1 },
63 domain: { $addToSet: '$domain' }
64 }
65 },
66 { $sort : { count : -1} }
67]);
68
69
70 /* 1 */
71 {
72 "_id" : "us",
73 "count" : 117.0,
74 "domain" : [
75 "http://daandehn.com",
76 "https://articles.imperialtometric.com",
77 "http://anglican.org",
78 "http://mikebonnice.com",
79 "https://nl.pinterest.com",
80 "http://svenskadress.net",
81 "http://word-dialect.blogspot.com",
82 "http://fhr.kiwicelts.com",
83 "http://www.huapala.org",
84 "http://www.whoisthatr.com",
85 "http://www.precious-testimonies.com",
86 "https://www.oemsec.com",
87 "http://www.godrules.net",
88 "https://www.pinterest.it",
89 "http://www.wikitree.com",
90 "http://ritusehji.blogspot.com",
91 "http://www.frogsonline.com",
92 "https://biblehub.com",
93 "https://www.pinterest.co.uk",
94 "http://pumanawawhangara.blogspot.com",
95 "http://hannas-reiseblog.blogspot.com",
96 "http://frontrowphotos.com",
97 "https://www.pinterest.ca",
98 "http://www.muhammad.com",
99 "https://www.pinterest.jp",
100 "http://www.gotquestions.org",
101 "https://www.dbnames.net",
102 "http://www.hudl.com",
103 "https://ebible.org",
104 "http://tuhua2010.blogspot.com",
105 "http://ww25.milfsplease.com",
106 "http://www.thesalmons.org",
107 "https://wol.jw.org",
108 "http://georgegi.tripod.com",
109 "http://linkvip.top",
110 "https://docs.google.com",
111 "http://rangiwewehi.com",
112 "http://anglicanhistory.org",
113 "http://niken8media.logdown.com",
114 "http://mrshamiltonskoolkidz.blogspot.com",
115 "https://www.vaihaunui.net",
116 "http://dannykahei.tripod.com",
117 "http://www.lunar-occultations.com",
118 "http://seapixonline.com",
119 "http://tkrow.tripod.com",
120 "https://drive.google.com",
121 "http://takethatvacation.com",
122 "https://in.pinterest.com",
123 "https://www.nccri.ie",
124 "https://www.webwiki.com",
125 "http://www.unicode.org",
126 "http://shangrilapress.net",
127 "http://ngarangatahi.tripod.com",
128 "https://static-promote.weebly.com",
129 "https://www.podrozeady.com",
130 "https://www.blue-frontiers.com",
131 "https://www.indexmundi.com",
132 "http://www.namesdir.com",
133 "https://www.bible.com",
134 "http://www.krassotkin.ru",
135 "http://malecek.com",
136 "http://korora.econ.yale.edu",
137 "https://www.poehalisnami.ua",
138 "http://loquevendra318.com",
139 "https://www.terakau.org",
140 "https://za.pinterest.com",
141 "http://www.mkiwi.com",
142 "http://maaori.com",
143 "http://atopeconlostopes.blogspot.com",
144 "http://worldradiomap.com",
145 "http://eartheum.com",
146 "http://www.forensicfashion.com",
147 "http://www.code-postal.com",
148 "http://www.pressreader.com",
149 "https://www.seapixonline.com",
150 "http://lianzaconference2012.blogspot.com",
151 "http://blogdepasopor.blogspot.com",
152 "https://www.code-postal.com",
153 "http://www.steve-wheeler.co.uk",
154 "https://www.knowatom.com",
155 "http://bahaiprayers.net",
156 "http://www.eyecontactsite.com",
157 "http://www.hiroa.pf",
158 "http://mahoraroom8.blogspot.com",
159 "http://www.roadsmile.com",
160 "https://chromium.googlesource.com",
161 "http://aclhokiangarocks.blogspot.com",
162 "http://wowwars.net",
163 "https://www.hidroponia.org.mx",
164 "http://tkkpipipaopao.blogspot.com",
165 "http://tatai09.blogspot.com",
166 "http://kiaorahola.blogspot.com",
167 "http://manateina.blogspot.com",
168 "http://www.the-naked.com",
169 "http://shuttersportnelson.photoshelter.com",
170 "http://precious-testimonies.com",
171 "https://www.breaker.audio",
172 "https://www.natekore2018.com",
173 "http://naturalfatburner.net",
174 "https://www.pinterest.fr",
175 "https://www.pipirikiapapatuanuku.org",
176 "http://capsuraotearoa.blogspot.com",
177 "http://m.biblepub.com",
178 "https://phet.colorado.edu",
179 "https://livestream.com",
180 "http://www.geni.com",
181 "https://kjohnsonnz.blogspot.com",
182 "https://maorinews.com",
183 "http://www.twttoa.com",
184 "http://www.whoisentry.com",
185 "http://burkekm001.tripod.com",
186 "http://wikiedit.org",
187 "http://piripi.blogspot.com",
188 "https://www.kaifineart.com",
189 "https://png.bible",
190 "http://rhymebrain.com",
191 "http://www.v3whois.com",
192 "http://www.waimate.com",
193 "https://www.myadsclassified.com"
194 ]
195 }
196
197 /* 2 */
198 {
199 "_id" : "de",
200 "count" : 19.0,
201 "domain" : [
202 "http://www.udhr.de",
203 "http://m.distanta.1km.net",
204 "http://arts.mythologica.fr",
205 "http://vulkane.ch",
206 "http://www.behlig.de",
207 "http://www.nierstrasz.org",
208 "https://www.tvteile.de",
209 "http://etymologie.info",
210 "https://www.cartogiraffe.com",
211 "https://www.you-fly.com",
212 "http://klaaskoehne.de",
213 "http://weltderberge.de",
214 "http://www.cartogiraffe.com",
215 "http://svenkirsten.com",
216 "https://laskar02cinta.page.tl",
217 "http://etoile-de-lune.net",
218 "https://ersatzteile-fachversand.de",
219 "http://insecta.pro",
220 "http://www.stephe.de"
221 ]
222 }
223
224 /* 3 */
225 {
226 "_id" : "fr",
227 "count" : 16.0,
228 "domain" : [
229 "http://baladeornithologique.com",
230 "http://chantsdeluttes.free.fr",
231 "http://kihikihi.fr",
232 "http://www.blueheavenisland.com",
233 "http://splaf.free.fr",
234 "https://www.lexilogos.com",
235 "https://www.manualscat.com",
236 "http://rapanui.fr",
237 "http://www.gototahiti.net",
238 "http://pt.city-usa.net",
239 "http://blueheavenisland.com",
240 "http://www.gif.ovh",
241 "http://www.gaudry.be",
242 "http://www.maraamusurfskirace.com",
243 "http://mahajana.net",
244 "http://www.rongo-rongo.com"
245 ]
246 }
247
248 /* 4 */
249 {
250 "_id" : "nl",
251 "count" : 16.0,
252 "domain" : [
253 "http://tetsubo.org",
254 "https://arrowhead.eu",
255 "https://www.arrowhead.eu",
256 "https://www.henrifloor.nl",
257 "http://hidsonphoto.com",
258 "https://arrowheadproject.azurewebsites.net",
259 "http://nielsonboutique.co.uk",
260 "http://www.gouvernante.info",
261 "http://tonhut.nl",
262 "http://diverosa.com",
263 "http://longhornlaw.net",
264 "http://www.nonlinear.demon.nl",
265 "http://wearehomework.com",
266 "http://gouvernante.info",
267 "http://skimap.info",
268 "http://www.encyclo.co.uk"
269 ]
270 }
271
272 /* 5 */
273 {
274 "_id" : "dk",
275 "count" : 8.0,
276 "domain" : [
277 "http://powhiri.ngapuhitelevision.com",
278 "http://akona.ngapuhitelevision.com",
279 "http://ngapuhitelevision.com",
280 "http://ngapuhiradio.com",
281 "http://komisch.ngapuhitelevision.com",
282 "http://www.rennertweb.de",
283 "http://jazz.ngapuhitelevision.com",
284 "http://waiatarangatiratanga.ngapuhitelevision.com"
285 ]
286 }
287
288 /* 6 */
289 {
290 "_id" : "ca",
291 "count" : 7.0,
292 "domain" : [
293 "http://00.gs",
294 "http://aguadilla.airport-authority.com",
295 "http://bckayak.com",
296 "http://bcmarina.com",
297 "http://www.myrasplace.net"
298 ]
299 }
300
301 /* 7 */
302 {
303 "_id" : "au",
304 "count" : 5.0,
305 "domain" : [
306 "https://infogram.com",
307 "https://www.kiwiproperty.com",
308 "http://theunderwaterworld.com",
309 "http://fionajack.net",
310 "https://koreromaori.com"
311 ]
312 }
313
314 /* 8 */
315 {
316 "_id" : "cz",
317 "count" : 4.0,
318 "domain" : [
319 "http://www.henryklahola.nazory.cz",
320 "http://about.ilikeyou.com",
321 "http://henryklahola.nazory.cz",
322 "https://www.fipojobs.com"
323 ]
324 }
325
326 /* 9 */
327 {
328 "_id" : "gb",
329 "count" : 4.0,
330 "domain" : [
331 "http://www.wordsearchfun.com",
332 "http://www.woolrych.org",
333 "https://omniatlas.com",
334 "http://mikestephens.co.uk"
335 ]
336 }
337
338 /* 10 */
339 {
340 "_id" : "es",
341 "count" : 4.0,
342 "domain" : [
343 "http://www.info-hoteles.com",
344 "https://www.uv.es",
345 "https://www.reclamaciondevuelos.com",
346 "http://www.cruceros-princess.mx"
347 ]
348 }
349
350 /* 11 */
351 {
352 "_id" : "at",
353 "count" : 3.0,
354 "domain" : [
355 "http://www.petit-prince.at",
356 "http://www.tmtmm.net",
357 "http://petit-prince.at"
358 ]
359 }
360
361 /* 12 */
362 {
363 "_id" : "it",
364 "count" : 3.0,
365 "domain" : [
366 "http://oipaz.net",
367 "http://www.pegasoesmicamion.com",
368 "http://www.marcosanti.it"
369 ]
370 }
371
372 /* 13 */
373 {
374 "_id" : "il",
375 "count" : 2.0,
376 "domain" : [
377 "http://www.daat.ac.il",
378 "https://www.hitiaotera.com"
379 ]
380 }
381
382 /* 14 */
383 {
384 "_id" : "ch",
385 "count" : 2.0,
386 "domain" : [
387 "https://nicoledidi.ch",
388 "https://photos.axelebert.org"
389 ]
390 }
391
392 /* 15 */
393 {
394 "_id" : "ro",
395 "count" : 2.0,
396 "domain" : [
397 "http://www.parohiauceadesus.ro",
398 "http://parohiauceadesus.ro"
399 ]
400 }
401
402 /* 16 */
403 {
404 "_id" : "mx",
405 "count" : 1.0,
406 "domain" : [
407 "http://www.gelbukh.com"
408 ]
409 }
410
411 /* 17 */
412 {
413 "_id" : "unknown",
414 "count" : 1.0,
415 "domain" : [
416 "https://www.viveipcl.com"
417 ]
418 }
419
420 /* 18 */
421 {
422 "_id" : "bg",
423 "count" : 1.0,
424 "domain" : [
425 "http://anitra.net"
426 ]
427 }
428
429 /* 19 */
430 {
431 "_id" : "cn",
432 "count" : 1.0,
433 "domain" : [
434 "http://kiwi2china.com"
435 ]
436 }
437
438 /* 20 */
439 {
440 "_id" : "ir",
441 "count" : 1.0,
442 "domain" : [
443 "https://www.dideo.ir"
444 ]
445 }
446
447 /* 21 */
448 {
449 "_id" : "fi",
450 "count" : 1.0,
451 "domain" : [
452 "http://pertti.com"
453 ]
454 }
455
456 /* 22 */
457 {
458 "_id" : "ie",
459 "count" : 1.0,
460 "domain" : [
461 "https://coggle.it"
462 ]
463 }
464
465 /* 23 */
466 {
467 "_id" : "ru",
468 "count" : 1.0,
469 "domain" : [
470 "https://www.gismeteo.lv"
471 ]
472 }
473
474 /* 24 */
475 {
476 "_id" : "jp",
477 "count" : 1.0,
478 "domain" : [
479 "http://yutaka.it-n.jp"
480 ]
481 }
482
483
484
485Can inspect websites' pages for whether it's relevant vs auto-translated as follows:
486 db.getCollection('Webpages').find({URL:/svenkirsten.com/, mriSentenceCount: {$gt: 0}})
487
488
489* CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/
490 BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/
491
492* FR: 16 sites from FR
493 http://blueheavenisland.com, http://www.blueheavenisland.com - misdetection. French Polynesia
494 https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway
495 http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori.
496!! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
497 http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names
498X http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
499 http://mahajana.net - misdetected a Japanese Zen Buddhist chant as MRI
500 http://rapanui.fr - Rapa Nui Easter Island. Misdetected.
501 http://www.gif.ovh - autotranslated pages. Supposedly a GIF repository
502 http://baladeornithologique.com - misdetection of the word "Retour"
503 http://www.gaudry.be - misdetection of Japanese hiragana etc, and French "faire", as MRI
504 http://www.gototahiti.net - probably misdetection, see title
505 http://www.maraamusurfskirace.com - Bora Bora, French Polynesia. Misdetected.
506 http://www.rongo-rongo.com - appears to be related to Easter Island. Just 1 sentence however.
507 http://pt.city-usa.net - misdetection. Hawaii.
508 https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages.
509NL:
510(!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm]
511- https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL
512- tonhut.nl - misidentication
513? http://nielsonboutique.co.uk, http://longhornlaw.net, http://tetsubo.org, http://hidsonphoto.com, http://wearehomework.com/- Feels autotranslated, but no language options visible. All SEO related
514- diverosa.com - Rapa Nui, Easter Island
515- nonlinear.demon.nl - misidentified
516- encyclo.co.uk - misidentification
517- henrifloor.nl - misidentification
518- http://skimap.info/ - maps, NZ placenames in PDF
519DK:
520!! ++ http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,
521http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com,
522http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com
523- http://www.rennertweb.de - a photogallery page mentioning NZ placenames
524CA:
525- http://bcmarina.com AND http://bckayak.com - photos with Canadian placenames
526- http://www.myrasplace.net - pagse of photos, captions involving NZ placenames
527~ http://00.gs/Maniapoto;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing
528- aguadilla.airport-authority.com - misidentification
529- https://articles.imperialtometric.com - misidentification
530- http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames
531DE:
532- http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful
533!! https://www.cartogiraffe.com/ and https://www.cartogiraffe.com - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
534~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden"
535- herocity - autotranslated
536- weltderberge.de - 3 pages mention NZ mountains by name.
537~ (arts.mythologica.fr) https://mythologica.fr/oceanie/texte/pantheon_polynesien.pdf - mentions certain Maori Gods and other Polynesian Gods by name.
538- https://traynews.com - nothing in MRI, misdetected
539~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names
540- http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI
541X https://afrikhepri.org/mi/ - autotranslated
542- https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word
543- etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected
544- https://www.you-fly.com - misdetection of German "Warum?" as MRI
545- http://vulkane.ch - misdetected pages on Hawaiian volcanoes.
546- http://www.stephe.de - photos from NZ captioned with NZ placenames
547- http://insecta.pro - misdetection
548- http://m.distanta.1km.net - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence.
549- https://ersatzteile-fachversand.de - German misdetected as Maori.
550- https://laskar02cinta.page.tl/Info.htm - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense.
551- http://www.behlig.de - misdetection. Photos from Hawaii.
552!! http://www.udhr.de - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available.
553- ITALY:
554 http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename
555 http://www.marcosanti.it/Reportage/Oceania_ph/Nuova_Zelanda/ - each photo captioned by NZ placename
556 http://www.pegasoesmicamion.com/ - REO abbreviation misidentified, also in REO%20PUBLICIDAD.htm
557- AUSTRIA:
558 petit-prince.at - Tahitian and Wayuu (Venezuela) translations of Le Petit Prince
559 http://www.tmtmm.net/newzealand - photos from NZ named after places and people's names
560- ROMANIA: parohiauceadesus.ro - Sentences of single Romanian words misidentified.
561- ISRAEL:
562 http://www.daat.ac.il - misidentification of "no." as MRI, and Hebrew words.
563 https://www.hitiaotera.com/ - misidentifiation of Tahitian pages
564- RUSSIA: https://www.gismeteo.lv - misidentification of an email address
565- JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI
566!! - Ireland, ie: https://coggle.it
567- IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website
568- CZECH republic:
569? https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English
570!! http://www.henryklahola.nazory.cz/094.Maori.htm and http://henryklahola.nazory.cz variant
571 http://about.ilikeyou.com - dating site. Misidentification.
572- SPAIN:
573!! https://www.uv.es/~pla/red.net/intmaori.html
574 https://www.reclamaciondevuelos.com - 2 occurrences of the word "kiwi"
575 http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename
576 http://www.cruceros-princess.mx/princessMX/Oferta_Cruzeiros_Polinesia.html - Polynesian placenames
577- SINGAPORE: https://omg-solutions.com - autotranslated
578- TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated
579- MEXICO: http://www.gelbukh.com - misidentification, lines of just numbers or phrases like "Area Chair" in English and Russian CVs.
580- FINLAND: http://pertti.com - travelogue, placenames
581- SWITZERLAND CH:
582 nicoledidi.ch - blog, placenames
583 https://photos.axelebert.org - Tahiti related content
584- UNKNOWN: https://www.viveipcl.com: tours website, placenames mentioned
585#- EU: https://www.the-good-stuff-factory.be/mi/ : Autotranslated
586!! - BULGARIA: http://anitra.net/activism/humanrights/UDHR/rrt_print.htm (2 pages)
587
588
589AUSTRALIA:
590!! https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]
591? http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!")
592X!! https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd - site of individual pages (like docs.google.com). This one has a relevant infogram image. But it's English with MRI in the image legend and captions.
593!! https://koreromaori.com - some actual Maori language sentences
594 http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames
595
596UK:
597 http://www.wordsearchfun.com/200628_Word_Find_wordsearch.html - 2 word games with Maori words (one of them has 3 different views, e.g. print view)
598? https://omniatlas.com/maps/australasia/18400206/plain/ - historical map with Maori iwi names over NZ map regions
599? https://omniatlas.com/maps/australasia/18400206/ - historical map of Australia and NZ at the time of the Treaty of Waitangi, with events marked in English
600 https://centrallanguageschool.com - AUTOTRANSLATED
601 https://www.solasolv.com - Autotranslated product site
602 http://mikestephens.co.uk/ - photo captions containing NZ placenames
603 http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames
604
605
606US:
607Done: manually inspected 68/117 sites
608
609TOTAL US: 4+7+7+4+3=25
610
611DEFINITELY:
612+ http://anglicanhistory.org,
613+ http://www.unicode.org, [Universal declaration of Human Rights]
614+ https://static-promote.weebly.com,
615+ http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY. At least short or partial MRI sentences.]
616
617BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations:
618+ http://bahaiprayers.net, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too]
619+ https://biblehub.com,
620+ http://www.muhammad.com, [possibly not autotranslated]
621+ http://www.godrules.net, [possibly not autotranslated]
622+ http://m.biblepub.com,
623+ http://www.krassotkin.ru, [probably real translations, as there are multiple Dutch translations from different sources provided]
624+ http://www.gotquestions.org, [doesn't appear autotranslated]
625X https://ebible.org, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori]
626X https://www.bible.com, doesn't have Maori translation. Misdetected.
627X https://wol.jw.org, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected
628X https://png.bible, [misdetected, Papua New Guinea]
629X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters.
630
631CHECK, PROBABLY - PROCESSED:
632!! https://maorinews.com,
633!! http://maaori.com,
634!!+ http://kiaorahola.blogspot.com,
635+ https://kjohnsonnz.blogspot.com,
636+ http://pumanawawhangara.blogspot.com,
637+ http://dannykahei.tripod.com,
638+ http://burkekm001.tripod.com,
639+ http://tkkpipipaopao.blogspot.com,
640+ http://manateina.blogspot.com,
641? tkkpipipaopao.blogspot.com? http://rangiwewehi.com, [English, but community]
642? https://www.terakau.org, [COMMUNITY, but English]
643? https://www.pipirikiapapatuanuku.org, [COMMUNITY?, in English, environment site]
644~ http://georgegi.tripod.com,
645~ http://ngarangatahi.tripod.com, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden]
646X http://fhr.kiwicelts.com,
647X http://tkrow.tripod.com, [English, background of NZ place]
648X http://www.mkiwi.com, - placenames
649X http://www.waimate.com, [English, NZ place]
650
651MAYBE, INSPECT - PROCESSED:
652? https://www.natekore2018.com, [lots of English, but COMMUNITY, CULTURE]
653+ http://tatai09.blogspot.com,
654+ http://www.twttoa.com,
655+ http://tuhua2010.blogspot.com,
656X http://www.huapala.org, [misdetected, Hawaiian]
657X https://www.vaihaunui.net, [misdetected, Tahiti]
658X https://www.kaifineart.com, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected]
659X http://mahoraroom8.blogspot.com, [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI
660+ http://piripi.blogspot.com,
661X http://www.hiroa.pf, [misdetected. Crawled content appears Polynesian not Maori]
662X http://korora.econ.yale.edu, [NZ place photo caption]
663X https://www.poehalisnami.ua, [mostly Cyrillic, with some NZ or Polynesian names misdetected]
664X http://hannas-reiseblog.blogspot.com - one page contained NZ placenames, another had a word misdetected
665
666
667+ https://www.breaker.audio, [audio, with occasional English.]
668? https://livestream.com, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?]
669
670X https://docs.google.com, timetable with occasional Maori language word
671+ https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position.
672http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
673
674
675PINTEREST
676+ https://in.pinterest.com/pin/317363104978423418/
677 "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google"
678? https://za.pinterest.com/pin/524669425310419500/
679 Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE]
680[The other pinterest detected as numPagesContainingMRI > 0 was misdetected]
681
682https://nl.pinterest.com,
683https://www.pinterest.jp,
684https://www.pinterest.it,
685https://www.pinterest.co.uk,
686https://www.pinterest.ca,
687https://za.pinterest.com,
688https://www.pinterest.fr,
689https://in.pinterest.com,
690
691MORE BLOGSPOTS
692X http://word-dialect.blogspot.com, [Indonesian, misdetected]
693~ http://atopeconlostopes.blogspot.com, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.]
694X http://lianzaconference2012.blogspot.com, [NZ placename or institution]
695? http://mrshamiltonskoolkidz.blogspot.com, [te reo Maori related school activities. Described in English.]
696X http://capsuraotearoa.blogspot.com, [blog in French, photo captions contain NZ placenames]
697X http://blogdepasopor.blogspot.com, [blog in French, Rapa Nui/Easter Island related content, misdetected.]
698
699
700UNLIKELY
701?? http://naturalfatburner.net, http://naturalfatburner.net/NoNonsenseTed/fatloss-mao/ feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick]
702
703
704BLACKLIST:
705X http://ww25.milfsplease.com,
706X http://www.the-naked.com
707
708OTHER:
709X http://seapixonline.com, https://www.seapixonline.com, [photo captions of ships. Sometimes misdetected Japanese words as MRI.]
710X http://www.code-postal.com, https://www.code-postal.com, [not more than 1 sentence detected as in MRI]
711X https://www.dbnames.net, [Name database, lots misdetected]
712
713STILL TO DO LIST - PROCESSED:
714
715X https://www.myadsclassified.com, [misdetected 3 short English sentences as MRI]
716X http://www.whoisthatr.com, [misdetected short English sentence as MRI]
717X https://www.oemsec.com, [autotranslated product site]
718X http://svenskadress.net, [linkfarm like site of related junk links, contained URLs misdetected as MRI]
719
720X https://www.webwiki.com, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ]
721X http://mikebonnice.com, [Hawaiian and Tahiti related content misdetected]
722X http://www.hudl.com, [misdetected short English sentence as MRI]
723X http://www.wikitree.com, [misdetected short English sentence as MRI]
724X http://shuttersportnelson.photoshelter.com, [image captions of "Wairua Warrior"]
725
726X http://niken8media.logdown.com, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.]
727X https://www.podrozeady.com, Looks Polish or other East-European language. The NZ page https://www.podrozeady.com/NZ/4/ had placenames detected.
728
729X http://www.thesalmons.org, [detection and misdetection of author names of papers hosted]
730
731X http://linkvip.top, [.rar and media file links misdetected as MRI]
732
733
734X http://www.lunar-occultations.com, [NZ place names for astronomical phenomena]
735X http://shangrilapress.net, [NZ placenames]
736X http://malecek.com, [misdetection CD title]
737X https://www.blue-frontiers.com, [Tahitian, Reo Tahiti, misdetected as MRI]
738X http://www.whoisentry.com, [URL names, looked at several which were probably misdetected as MRI]
739X http://loquevendra318.com, [uses Google translate for auto-translation]
740
741
742?? http://www.forensicfashion.com, [historical information, useful for CULTURE? e.g. http://www.forensicfashion.com/1807MaoriChief.html]
743
744X http://www.eyecontactsite.com, [Lots of names. And a few short sentences or words possibly in comments.]
745X http://eartheum.com, [Rapa Nui, Easter Island related content. Misdetected]
746X http://www.steve-wheeler.co.uk, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"]
747X https://chromium.googlesource.com, [some source code related to languages' two letter codes]
748
749X http://www.roadsmile.com, [Lots of misdetection based on word Kia.]
750?? https://www.knowatom.com, https://phet.colorado.edu [Similar looking science web sites for children. Uses auto-translation?]
751
752X https://www.indexmundi.com, [place names. Pages about Solomon Islands. Misdetection of placenames.]
753
754
755
756X http://wowwars.net, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ]
757?? https://www.hidroponia.org.mx, [Not sure if https://www.hidroponia.org.mx/index.php/idiomas/284-hydroponics-te-ahurea-wai-maori is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?]
758X http://www.v3whois.com, [URLs are misdetected as MRI]
759X http://rhymebrain.com, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages]
760
761
762X SINGLE SENTENCE DETECTED (NO MORE AND NOT WHOLE PAGE isMRI:)
763 http://frontrowphotos.com,
764 http://www.pressreader.com,
765 https://www.nccri.ie,
766 http://takethatvacation.com,
767 http://worldradiomap.com,
768 http://www.namesdir.com,
769
770 X http://www.frogsonline.com, [NZ hotels, placenames]
771 X http://www.geni.com, [Single sentence misdetection]
772 X http://wikiedit.org, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui]
773
774
775
776TOTALS:
777US: 25
778AU: 2
779DE: 2
780DK: 2
781BG: 1
782CZ: 1
783ES: 1
784FR: 1
785IE: 1
786TOTAL: 213
787
788------------------------------------------------
7892. Need to inspect all those sites with any webPAGE that has mi in its URL path (mi.* or */mi) that are not sites with nz TLD or originating in NZ:
790
791db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
792472
793
794(vs:
795db.getCollection('Websites').find({$and: [{numPagesInMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
796209)
797
798
799db.Websites.aggregate([
800 {
801 $match: {
802 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]
803 }
804 },
805 {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}},
806 { $sort : { count : -1} }
807])
808
809Also excluding AU, since we dealt with that already in step A1:
810 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$not: /NZ|AU/}}]}).count()
811= 471
812
813db.Websites.aggregate([
814 {
815 $match: {
816 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$not: /NZ|AU/}}]
817 }
818 },
819 {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}},
820 { $sort : { count : -1} }
821])
822
823
824 /* 1 */
825 {
826 "_id" : "US",
827 "count" : 305.0,
828 "domain" : [
829 "http://www.qjfiberglass.com",
830 "http://www.sokenswitch.com",
831 "https://follow3rs.com",
832 "http://www.zjnbzy.com",
833 "http://www.nbbvc.com",
834 "https://mamaclub.info",
835 "https://www.nbwinwinea.com",
836 "https://www.sdspraybooth.com",
837 "https://www.jlextract.com",
838 "http://www.artmetalcn.com",
839 "http://www.steel-in-china.com",
840 "http://www.sunnymaycn.com",
841 "https://www.weld-automation.com",
842 "http://www.homewin88.com",
843 "http://indigenousblogs.com",
844 "https://biblia.gospelprime.com.br",
845 "http://www.jhc-nonwoven.com",
846 "http://www.forever-moving.com",
847 "http://www.eternal-friendship.com",
848 "http://www.conele-mixer.com",
849 "http://binaryoptionsindicators.com",
850 "http://www.mao-shuo.com",
851 "http://www.bmaxmachine.com",
852 "http://www.sdxhhd.com",
853 "http://lingeriefc.com",
854 "http://infomutt.com",
855 "https://jobdescriptionsample.org",
856 "http://mi.hongwugas.com",
857 "http://www.lishin.cc",
858 "http://www.mksmartcard.com",
859 "http://www.toption-ingredients.com",
860 "http://www.sps-squeegee.com",
861 "https://facebook.roseconverter.com",
862 "https://www.bestpvcfence.com",
863 "http://www.tubemillcn.com",
864 "http://www.jindunlaobao.com",
865 "http://www.wavesspring.com",
866 "http://www.restart-industry.com",
867 "https://mi.nyecountdown.com",
868 "http://www.ycautoc.com",
869 "http://www.htwindsolarpower.com",
870 "http://www.joyseaplywood.com",
871 "http://www.teda-hydraulic.com",
872 "http://www.gmk-valve.com",
873 "https://usahello.org",
874 "https://www.datemypet.com",
875 "https://worldstarhiphop.roseconverter.com",
876 "http://www.sdtzgloves.com",
877 "https://www.airpullfilter.com",
878 "http://www.bestwaytowhitenteethguide.org",
879 "https://mi.m.wikipedia.org",
880 "http://www.analiabriz.com",
881 "http://www.xida-electronics.com",
882 "http://www.weldpipemill.com",
883 "http://www.ictctruss.com",
884 "https://www.junschem.com",
885 "https://www.judinwire.net",
886 "http://www.mtpak.com",
887 "http://www.nantaidiesel.com",
888 "https://www.nbkeming.com",
889 "http://www.windsolarchina.com",
890 "http://www.gormeet.com",
891 "http://www.wf-fastener.com",
892 "http://www.pressurelantern.com",
893 "https://www.drickinstruments.com",
894 "http://milfsplease.com",
895 "http://www.risepipe.com",
896 "https://www.yourcloudlibrary.com",
897 "http://portal.smart-project.info",
898 "http://cdn.centrallanguageschool.com",
899 "http://www.shengxinsport.com",
900 "https://www.tianjia-lock.com",
901 "http://www.julongjewelry.cn",
902 "https://www.yogemcasting.com",
903 "http://www.chinaocan.com",
904 "http://www.autosunsoul.com",
905 "https://www.prostepper.com",
906 "https://www.pldyes.com",
907 "http://www.nicerelay.com",
908 "https://www.sinodryair.com",
909 "https://www.risenltd.com",
910 "http://www.albertnovosino.com",
911 "http://www.ttyzfilter.com",
912 "http://www.bst-elecs.com",
913 "http://www.hqftex.com",
914 "http://www.kehengmixing.com",
915 "http://loginmail.online",
916 "http://www.bluekin.com",
917 "https://blondewebcamgirl.com",
918 "https://www.nickel-alloy.net",
919 "https://www.hjfoodmachinery.com",
920 "http://csunplugged.org",
921 "https://www.inpnurseryproducts.com",
922 "http://www.americasportsfloor.com",
923 "http://www.dmdryer.com",
924 "http://www.sindadisplay.com",
925 "http://www.focusway-casting.com",
926 "https://mi.centr-zashity.ru",
927 "https://www.axnewdisplay.com",
928 "https://cycletraderpro.com",
929 "https://vk.roseconverter.com",
930 "https://www.livehoster.com",
931 "http://www.cnsongben.com",
932 "http://www.mytrickstips.com",
933 "http://www.quickcncmachine.com",
934 "http://www.arjextrailerparts.com",
935 "http://www.shshenyong.com",
936 "http://www.pvcroofingtile.com",
937 "http://www.wrdtubemill.com",
938 "http://church-of-christ.org",
939 "https://www.td-casting.com",
940 "https://www.hengweihoseclamp.com",
941 "http://www.wpcline.com",
942 "https://www.kubbamachine.com",
943 "http://www.goethe.de",
944 "https://www.njkeyuda.com",
945 "http://www.prostepper.com",
946 "http://www.cnfeinade.com",
947 "http://www.huamachinery.com",
948 "http://www.damiser.com",
949 "http://www.shanghailangzhiweld.com",
950 "http://www.fanhaopets.com",
951 "https://blockchains.io",
952 "http://www.inpnurseryproducts.com",
953 "http://www.yixinhetrade.com",
954 "http://www.newbaoquan.com",
955 "https://mi.lawyers.cafe",
956 "http://www.shenhe-bearing.com",
957 "http://atoall.com",
958 "http://www.vango-tech.com",
959 "https://www.gigalight.com",
960 "http://www.ladybagcn.com",
961 "http://www.tjcywires.com",
962 "http://www.vigor-industry.com",
963 "http://www.litbright-candles.com",
964 "http://www.nide-industry.com",
965 "http://www.cnfreda.com",
966 "http://www.jbpcba.com.cn",
967 "http://www.qitai-adhesive.com",
968 "http://www.weld-automation.com",
969 "http://www.cnyaonan.com",
970 "http://www.ruifeng-leather.com",
971 "http://www-hotmail-com.email",
972 "http://www.jointcontrols.net",
973 "https://twitter.roseconverter.com",
974 "https://www.aquagem.com.cn",
975 "http://www.seasum.cn",
976 "http://www.steelprotectionpack.com",
977 "http://www.suoxuehuwai.com",
978 "http://www.sunshinebelt.com",
979 "http://www.nyforgedwheels.com",
980 "http://www.amcbox.com",
981 "http://www.livepro-beauty.com",
982 "http://www.nbyobo.com",
983 "http://www.chinacarbonfibre.com",
984 "https://guidebooq.com",
985 "https://www.hello4x4.com",
986 "http://www.zhonghe222.com",
987 "http://www.church-of-christ.org",
988 "https://www.czzhit.com",
989 "https://www.king-pcb.com",
990 "http://www.secondhormone.com",
991 "http://www.sxceramic.com",
992 "http://www.hobbycarbon.com",
993 "http://www.bdknitting.com",
994 "http://www.ntvigourbrush.com",
995 "http://www.china-brewhouse.com",
996 "http://mi.tccasdic.com",
997 "http://www.hzhinew.com",
998 "http://www.silicone-odm.com",
999 "http://www.liweimetal.com",
1000 "http://www.huaxinfurnace.com",
1001 "http://www.envicool.net",
1002 "http://www.cnxh-electric.com",
1003 "http://www.jiejingfactory.com",
1004 "http://www.longda-inc.com",
1005 "http://www.pamaens.com",
1006 "http://www.sdcncrouter.com",
1007 "http://www.tkfanen.com",
1008 "http://www.touchdisplays-tech.com",
1009 "http://www.twtvalvecn.com",
1010 "http://www.weddingfurniture.com",
1011 "https://www.huadongmedical.com",
1012 "http://www.ledecofr.com",
1013 "http://www.rosin-kings.com",
1014 "http://www.aluminum-profiles-supplier.com",
1015 "http://www.cannapresso.com",
1016 "https://www.cz-juteng.com",
1017 "http://www.strongsaw.com",
1018 "http://jobdescriptionsample.org",
1019 "http://www.btmeac.com",
1020 "http://www.nicehut-window.com",
1021 "http://www.accotech.net",
1022 "https://www.dshprecision.com",
1023 "http://www.gemnice.com",
1024 "http://www.richina-tools.com",
1025 "http://www.brushcutterjusen.com",
1026 "http://www.szhaiwang.com",
1027 "https://www.conele-mixer.com",
1028 "https://www.tkthvac.com",
1029 "http://technobuzzer.com",
1030 "https://www.csunplugged.org",
1031 "http://www.ainuogas.com",
1032 "https://policies.oclc.org",
1033 "http://www.xfinsulation.com",
1034 "http://www.lanlinprintech.com",
1035 "http://www.yrkseal.com",
1036 "http://www.jpslurrypump.com",
1037 "http://www.soontruepackaging.com",
1038 "http://www.shengrunqiche.com",
1039 "http://www.luluae.com",
1040 "https://www.judipak.com",
1041 "http://www.cz-juteng.com",
1042 "http://www.jiajiebathmirror.com",
1043 "http://www.bigrollscloth.com",
1044 "http://www.chinatopcnc.com",
1045 "https://drugsinc.eu",
1046 "http://www.wosaicabinet.com",
1047 "http://www.wellfit-sportswear.com",
1048 "http://www.pxbaisheng.com",
1049 "http://www.meihua-wm.com",
1050 "http://www.wzdongyi.com",
1051 "http://www.kd-physicalrehab.com",
1052 "http://www.longs-motor.com",
1053 "https://www.samsungwiremesh.com",
1054 "http://www.wellformpacking.com",
1055 "http://www.hs-stationery.com",
1056 "http://www.allutertech.com",
1057 "http://www.czzhit.com",
1058 "http://www.jlgrating.com",
1059 "http://www.qbd-group.com",
1060 "http://www.evaescort.net",
1061 "https://dwsolo.com",
1062 "http://www.chuamotor.com",
1063 "http://www.ksdoing.com",
1064 "http://mi.broadcastbeat.com",
1065 "http://www.czldfloor.com",
1066 "http://www.qypaperbox.com",
1067 "https://mi.wikipedia.org",
1068 "http://www.houshenshoes.com",
1069 "http://www.xzc9.com",
1070 "http://www.chinacombinerbox.com",
1071 "https://www.everfineplastics.com",
1072 "http://www.sinemagnetic.com",
1073 "http://www.linphos.com",
1074 "https://www.rikoooo.com",
1075 "http://www.ncpcpharma.com",
1076 "http://www.evergrowingcage.com",
1077 "http://www.qxmic.com",
1078 "https://www.fxcc.com",
1079 "http://www.ldsolarpv.com",
1080 "http://mytrickstips.com",
1081 "http://www.linbaymachinery.com",
1082 "http://www.photoprofix.com",
1083 "http://www.supplyfurniture.com",
1084 "http://www.honglu-mining.com",
1085 "http://www.szebo.com",
1086 "http://www.cnrgxy.com",
1087 "http://blicanada.net",
1088 "http://www.homey-tec.com",
1089 "http://www.whties.com",
1090 "http://www.zhenchengscrew.com",
1091 "http://www.ruk-tech.com",
1092 "http://www.longxin-global.com",
1093 "https://www.tymexnetting.com",
1094 "http://www.chinabosun.com",
1095 "http://www.b-packaging.com",
1096 "http://www.ncpcvet.com",
1097 "https://mi.kidspicturedictionary.com",
1098 "http://mi.guoguangelectric.com",
1099 "http://topbitcoincard.com",
1100 "https://atoall.com",
1101 "http://www.acouplefortheroad.com",
1102 "http://www.tongyujiaju.com",
1103 "http://www.chinapipemills.com",
1104 "http://www.infomutt.com",
1105 "http://www.fxctool.com",
1106 "http://www.samewe.net",
1107 "https://www.aquark.com.cn",
1108 "https://www.artiegarden.com",
1109 "http://www.fxpremiere.com",
1110 "http://www.sog-pump.com",
1111 "http://www.omnicnc.com",
1112 "https://www.waterproof-factory.com",
1113 "http://www.wanmaroto.com",
1114 "http://mi.gmpmetalwork.com",
1115 "https://www.webhostingsecretrevealed.net",
1116 "http://www.gecko-kalimba.com",
1117 "https://www.glorystarlaser.com",
1118 "http://www.viairdoormat.com",
1119 "https://vimeo.roseconverter.com",
1120 "https://www.fctele.com",
1121 "http://www.hzzjair.com",
1122 "https://2fish.co",
1123 "http://www.qymachines.com",
1124 "http://www.chinachairtable.com",
1125 "http://www.gfh-electric.com",
1126 "http://www.tangres100.com",
1127 "https://www.valve-pipe-fitting.com",
1128 "http://www.fancyco.com",
1129 "http://www.zhengmaoelec.com",
1130 "http://www.chinagxmy.com",
1131 "https://www.tjshenzhoutong.com",
1132 "https://maxspeedtest.com"
1133 ]
1134 }
1135
1136 /* 2 */
1137 {
1138 "_id" : "CN",
1139 "count" : 113.0,
1140 "domain" : [
1141 "https://www.fibereye2.com",
1142 "https://www.outstandingdm.com",
1143 "https://www.szradiant.com",
1144 "http://www.gmmdjx.com",
1145 "http://www.likvchina.com",
1146 "https://www.abdindustrial.com",
1147 "https://www.c-superun.com",
1148 "https://www.slagremoving.com",
1149 "https://www.sino-masterbatch.com",
1150 "http://www.cntiescarf.com",
1151 "https://www.dm-compressor.com",
1152 "https://www.szhtpmart.com",
1153 "https://www.phhydraulic.com",
1154 "https://www.imposalight.com",
1155 "https://www.medke.com",
1156 "http://www.eburn-burner.com",
1157 "https://www.haitungchem.com",
1158 "http://www.medicohongkong.com",
1159 "http://www.koowheel.com",
1160 "https://www.aerial-display.com",
1161 "https://www.cntfsolar.com",
1162 "https://www.aoxinhvacr.com",
1163 "https://www.diamante-tech.com",
1164 "https://www.richest-group.com",
1165 "http://www.world-starter.com",
1166 "http://www.goldenlaser.cc",
1167 "https://www.km-medicine.com",
1168 "https://www.safesworld.com",
1169 "https://www.peptidejymed.com",
1170 "https://www.nbhengchen.com",
1171 "https://www.xinyuesteel.com",
1172 "https://www.charmingmetal.com",
1173 "https://www.lasonparts.com",
1174 "https://www.ngyc.com",
1175 "https://www.pacopower.com",
1176 "https://www.tjtgsteel.com",
1177 "http://www.abdindustrial.com",
1178 "https://www.yangrutingtrade.com",
1179 "http://www.wedacdisplays.com",
1180 "https://www.gaofeng-petro.com",
1181 "https://www.ez-walk.com",
1182 "https://www.szzhsbag.com",
1183 "https://www.simphoenix.com",
1184 "http://www.focuslasersystems.com",
1185 "https://www.fc-med.com",
1186 "http://www.zypackag.com",
1187 "http://www.kavounautoparts.com",
1188 "https://www.foocles.com",
1189 "https://www.jsjlmachinery.com",
1190 "https://www.special-metal.com",
1191 "https://www.bestardoors.com",
1192 "http://www.wenwencf.com",
1193 "https://www.insharevape.com",
1194 "https://www.dghk-buffer.com",
1195 "https://www.n2o2gas.com",
1196 "https://www.changjia-machinery.com",
1197 "https://www.nfyo.com",
1198 "http://www.estarspareparts.com",
1199 "https://www.jsbotanics.com",
1200 "https://www.chinarfidcard.com",
1201 "https://www.sjzhgw.com",
1202 "https://www.study-mandarin.com",
1203 "https://www.qdruidetai.com",
1204 "https://www.zhongxinlighting.com",
1205 "http://www.qjqdvalve.com",
1206 "https://www.painting-machine.com",
1207 "https://www.bescatray.com",
1208 "https://www.tianseoffice.com",
1209 "https://www.herbal-ingredients.com",
1210 "https://www.qlart.com",
1211 "https://www.sehenda-en.com",
1212 "https://www.egbadges.com",
1213 "http://www.eudemonbaby.com",
1214 "http://www.3drambery.com",
1215 "https://www.chinawelken.com",
1216 "http://www.jsbotanics.com",
1217 "https://www.rswires.com",
1218 "https://www.zjyongqi.com",
1219 "https://www.micropreparedslides.com",
1220 "http://www.longtopmining.com",
1221 "https://www.rykay.com",
1222 "https://www.sdtoplit.com",
1223 "https://www.wecare-life.com",
1224 "http://www.wigglewires.com",
1225 "https://www.grandstarcn.com",
1226 "https://www.bailixin.com",
1227 "http://www.refinehotelsupply.com",
1228 "http://www.prius-automatic.com",
1229 "https://www.nbulboy.com",
1230 "https://www.jy-glass.com",
1231 "http://www.ankaicnc.com",
1232 "https://www.band-ss.com",
1233 "https://www.hytokstech.com",
1234 "https://www.goldnard.com",
1235 "http://www.comfortebicycle.com",
1236 "https://www.zengrit.com",
1237 "https://www.3drambery.com",
1238 "https://www.pakite.com",
1239 "https://www.xianglin-plastics.com",
1240 "https://www.inductorchina.com",
1241 "https://www.nbjiatong.com",
1242 "https://www.bofanpc.com",
1243 "https://www.sakysteel.com",
1244 "http://www.coneleqd.com",
1245 "https://www.jewellrylove.com",
1246 "http://www.nbwellrun.com",
1247 "http://www.yulong-cellulose-cmc.com",
1248 "https://www.aootan.com",
1249 "https://www.coffbrewing.com",
1250 "http://www.jetwayamenities.com",
1251 "https://english.taiergroup.com",
1252 "http://www.czhengfa.com",
1253 "https://www.sitzonechair.com"
1254 ]
1255 }
1256
1257 /* 3 */
1258 {
1259 "_id" : "FR",
1260 "count" : 19.0,
1261 "domain" : [
1262 "https://www.slotsltd.com",
1263 "https://mi.apicmo.com",
1264 "http://mi.psychicbonus.com",
1265 "http://mi.aasraw.com",
1266 "https://mi.hyperbaric-chamber.com",
1267 "https://mi.usa-casino-online.com",
1268 "https://mi.gem.agency",
1269 "https://mi.hghphuket.com",
1270 "https://mi.mehmetdursun.av.tr",
1271 "https://mi.mhthread.com",
1272 "https://mi.phcoker.com",
1273 "https://www.casino.uk.com",
1274 "https://www.planetkeyboard.com",
1275 "http://mi.outboard-boat-motor-repair.com",
1276 "http://www.gpedia.com",
1277 "http://mi.fitnessrebates.com",
1278 "https://www.expresscasino.com",
1279 "https://mi.petrpikora.com",
1280 "https://mi.isearch.de"
1281 ]
1282 }
1283
1284 /* 4 */
1285 {
1286 "_id" : "DE",
1287 "count" : 8.0,
1288 "domain" : [
1289 "https://herocity.de",
1290 "https://traynews.com",
1291 "http://www.almancax.com",
1292 "https://transposh.org",
1293 "http://transposh.org",
1294 "https://mi.vessoft.com",
1295 "https://www.saper-link-news.com",
1296 "https://afrikhepri.org"
1297 ]
1298 }
1299
1300 /* 5 */
1301 {
1302 "_id" : "NL",
1303 "count" : 6.0,
1304 "domain" : [
1305 "http://www.martinvrijland.nl",
1306 "https://realtytenerife.com",
1307 "https://www.bitbybitbook.com",
1308 "https://www.emergency-live.com",
1309 "http://www.cbdolievoordelen.nl",
1310 "http://www.spectrumschool.be"
1311 ]
1312 }
1313
1314 /* 6 */
1315 {
1316 "_id" : "CA",
1317 "count" : 5.0,
1318 "domain" : [
1319 "https://www.wikiplanet.click",
1320 "https://cloudsfeed.com",
1321 "http://newsrule.com",
1322 "http://dehaut.com",
1323 "https://www.chinanbdb.com"
1324 ]
1325 }
1326
1327 /* 7 */
1328 {
1329 "_id" : "HK",
1330 "count" : 2.0,
1331 "domain" : [
1332 "https://www.desunpump.com",
1333 "http://www.10turntables.com"
1334 ]
1335 }
1336
1337 /* 8 */
1338 {
1339 "_id" : "UA",
1340 "count" : 2.0,
1341 "domain" : [
1342 "http://ukraine.admission.center",
1343 "http://umsa.admission.center"
1344 ]
1345 }
1346
1347 /* 9 */
1348 {
1349 "_id" : "GB",
1350 "count" : 2.0,
1351 "domain" : [
1352 "https://www.centrallanguageschool.com",
1353 "https://www.solasolv.com"
1354 ]
1355 }
1356
1357 /* 10 */
1358 {
1359 "_id" : "UNKNOWN",
1360 "count" : 2.0,
1361 "domain" : [
1362 "https://mi.buyaas.com",
1363 "http://en.wiki.wintoflash.com"
1364 ]
1365 }
1366
1367 /* 11 */
1368 {
1369 "_id" : "ES",
1370 "count" : 1.0,
1371 "domain" : [
1372 "https://www.torresbus.es"
1373 ]
1374 }
1375
1376 /* 12 */
1377 {
1378 "_id" : "IE",
1379 "count" : 1.0,
1380 "domain" : [
1381 "http://netkiosk.co.uk"
1382 ]
1383 }
1384
1385 /* 13 */
1386 {
1387 "_id" : "RU",
1388 "count" : 1.0,
1389 "domain" : [
1390 "http://www.treningmozga.com"
1391 ]
1392 }
1393
1394 /* 14 */
1395 {
1396 "_id" : "SG",
1397 "count" : 1.0,
1398 "domain" : [
1399 "https://omg-solutions.com"
1400 ]
1401 }
1402
1403 /* 15 */
1404 {
1405 "_id" : "JP",
1406 "count" : 1.0,
1407 "domain" : [
1408 "https://forexmania.org"
1409 ]
1410 }
1411
1412 /* 16 */
1413 {
1414 "_id" : "EU",
1415 "count" : 1.0,
1416 "domain" : [
1417 "http://www.the-good-stuff-factory.be"
1418 ]
1419 }
1420
1421 /* 17 */
1422 {
1423 "_id" : "TR",
1424 "count" : 1.0,
1425 "domain" : [
1426 "https://www.elitedeluxe.com.tr"
1427 ]
1428 }
1429
1430
1431First, I eyeballed and excluded all obvious product sites which are automatically translated.
1432
1433Of interest or possible interest remain the following, grouped per country of site origin:
1434
1435US:
1436!! http://indigenousblogs.com [15/18 blogs work] - has one page in Maori (http://indigenousblogs.com/feeds/mi.xml)
1437X https://biblia.gospelprime.com.br - misdetection (containsMRI)
1438X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout
1439!! https://mi.m.wikipedia.org, https://mi.wikipedia.org
1440X https://usahello.org - autotranslated
1441X http://church-of-christ.org, http://www.church-of-christ.org - I think autotranslated, because "HET kerken van Christus" at https://church-of-christ.org/nl/ i.p.v. meervoud DE
1442X https://www.livehoster.com
1443X http://www.americasportsfloor.com, - product store. Misdetected
1444!! http://csunplugged.org, https://www.csunplugged.org - University of Canterbury NZ and site only available in EN, MI, DE, ES, CN
1445X https://mi.lawyers.cafe - autotranslated
1446 X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated
1447! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated.
1448X http://jobdescriptionsample.org - autotranslated
1449X http://mi.broadcastbeat.com - autotranslated product site
1450X http://www.samewe.net - autotranslated product site
1451X https://mi.kidspicturedictionary.com - autotranslated, but MAY BE USEFUL
1452X https://www.rikoooo.com - autotranslated
1453
1454CN: -
1455
1456FR:
1457? https://mi.phcoker.com - product site "Shangke Chemical Rapu + 86 (1812) 4514114 [email protected]"
1458X http://www.gpedia.com - dodgy copy of wikipedia, see http://www.gpedia.com/nl/gpedia/Hoofdpagina
1459
1460NL:
1461X http://www.martinvrijland.nl - wordpress, autotranslated
1462
1463CA:
1464X https://www.wikiplanet.click (seems like a dodgy copy of wikipedia)
1465X cloudsfeed.com - wordpress admin page
1466
1467
1468db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]})
1469=> http://indigenousblogs.com/mi/
1470
1471
1472TOTAL: Only 4 sites contain genuine MRI sentences that aren't automatically translated out of all non-NZ/non-AU sites that have "mi" in a webpage's URL path.
1473
1474
1475TOTALS:
1476US: 25+4 from US with mi in URL path = 29
1477AU: 2
1478DE: 2
1479DK: 2
1480BG: 1
1481CZ: 1
1482ES: 1
1483FR: 1
1484IE: 1
1485TOTAL: 213+4 from US with mi in URL path = 216
1486------------------------------------------------
1487B. NEW ZEALAND SITES: NZ origin + .nz TLD SITES
1488------------------------------------------------
14891. Get NZ sites numPagesContainingMRI > 0
1490
1491// To list domains in alphabetical order, which addToSet doesn't do, see
1492// https://stackoverflow.com/questions/21967233/sorting-aggregation-addtoset-result
1493
1494db.Websites.aggregate([
1495 {
1496 $match: {
1497 $and: [
1498 {numPagesContainingMRI: {$gt: 0}},
1499 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1500 ]
1501 }
1502 },
1503 { $unwind: "$geoLocationCountryCode" },
1504 {
1505 $group: {
1506 _id: "nz",
1507 count: { $sum: 1 },
1508 domain: {$push: "$basicDomain" }, /*domain: { $addToSet: '$domain' },*/
1509 /*numPagesInMRICount: { $sum: '$numPagesInMRI' },
1510 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }*/
1511 }
1512 },
1513 { $sort : { count : -1} }
1514 ]);
1515
1516165 UNIQUE SITE DOMAINS (NZ).
1517
1518/* 1 */
1519{
1520 "_id" : "nz",
1521 "count" : 182.0,
1522 "domain" : [
1523 "anglicanprayerbook.nz",
1524 "arataua.nz",
1525 "archerpix.com",
1526 "archive.electionresults.govt.nz",
1527 "archive.stats.govt.nz",
1528 "artizani.co.nz",
1529 "auturoa.nz",
1530 "avonside.net",
1531 "biketorqueyamaha.co.nz",
1532 "community.nzdl.org",
1533 "conference.tpwt.maori.nz",
1534 "crimson.co.nz",
1535 "dev.nzpcn.org.nz",
1536 "firstworldwar.tki.org.nz",
1537 "hana.co.nz",
1538 "hangaraumatihiko.tki.org.nz",
1539 "kaiiwicamp.nz",
1540 "kaupare.co.nz",
1541 "kmpmusic.co.nz",
1542 "kuraaiwi.maori.nz",
1543 "kurakokiri.maori.nz",
1544 "kuraproductions.co.nz",
1545 "kurataiao.tki.org.nz",
1546 "maori.livingheritage.org.nz",
1547 "maori.tki.org.nz",
1548 "myfathersworld.net.nz",
1549 "ngarauhuia.ngatiapakiterato.iwi.nz",
1550 "ngatipahauwera.co.nz",
1551 "ngatiporoukiponeke.org.nz",
1552 "ngatiwhakaue.iwi.nz",
1553 "nzpostcard.co.nz",
1554 "oilcrash.com",
1555 "otorohanga.directorybusiness.co.nz",
1556 "philipbeadle.co.nz",
1557 "pukapuka.nz",
1558 "pukekohe.directorybusiness.co.nz",
1559 "pukoro.co.nz",
1560 "punareo.co.nz",
1561 "rakaumanga.school.nz",
1562 "rexedra.gen.nz",
1563 "rsnz.natlib.govt.nz",
1564 "rurued.school.nz",
1565 "satellites.co.nz",
1566 "southerntribes.co.nz",
1567 "cms.sunsmartschools.co.nz",
1568 "talkingtothecan.com",
1569 "teaohou.natlib.govt.nz",
1570 "tehauora.org.nz",
1571 "temahurehure.maori.nz",
1572 "animations.tewhanake.maori.nz",
1573 "tiritiowaitangi.govt.nz",
1574 "tmoa.tki.org.nz",
1575 "w3vietnam.org.nz",
1576 "waiata.maori.nz",
1577 "waitarahistory.org.nz",
1578 "kete.wcl.govt.nz",
1579 "whatonga.school.nz",
1580 "biketorqueyamaha.co.nz",
1581 "brettgraham.co.nz",
1582 "finlaysonpark.school.nz",
1583 "firstworldwar.tki.org.nz",
1584 "gans.co.nz",
1585 "huri-translations.pf",
1586 "jeremybaker.nz",
1587 "kkmmaungarongo.co.nz",
1588 "kmk.maori.nz",
1589 "kura-porirua.school.nz",
1590 "kurakokiri.maori.nz",
1591 "livingheritage.org.nz",
1592 "matarikifestival.org.nz",
1593 "methodist.org.nz",
1594 "ngamanawainc.co.nz",
1595 "nzpcn.org.nz",
1596 "otepoti.school.nz",
1597 "pakanae.maori.nz",
1598 "rakaumanga.school.nz",
1599 "rotoruanz.com",
1600 "runanga.co.nz",
1601 "ruralfind.co.nz",
1602 "tasteofplenty.co.nz",
1603 "teipukarea.maori.nz",
1604 "temarareo.org",
1605 "tereowrap.nz",
1606 "tetaumuturunanga.iwi.nz",
1607 "tewhanake.maori.nz",
1608 "tkkmmokopuna.school.nz",
1609 "tmoa.tki.org.nz",
1610 "topomap.co.nz",
1611 "tuwharetoa.iwi.nz",
1612 "twtop.school.nz",
1613 "w3vietnam.org.nz",
1614 "waiata.maori.nz",
1615 "wcl.govt.nz",
1616 "writersfestival.co.nz",
1617 "zoomin.co.nz",
1618 "2019.nethui.nz",
1619 "28maoribattalion.org.nz",
1620 "admin.teara.govt.nz",
1621 "curriculumtool.education.govt.nz",
1622 "videos.e-agent.nz",
1623 "e-ako-pangarau.nzmaths.co.nz",
1624 "archive.electionresults.govt.nz",
1625 "givealittle.co.nz",
1626 "haereheikaiako.co.nz",
1627 "hepatakakupu.nz",
1628 "holyspirit.nz",
1629 "interactives.stuff.co.nz",
1630 "kaiiwicamp.nz",
1631 "keepourmoneyclean.govt.nz",
1632 "kotahimiriona.co.nz",
1633 "kupengahao.co.nz",
1634 "liveresults.co.nz",
1635 "m.wairarapatv.co.nz",
1636 "manawatuheritage.pncc.govt.nz",
1637 "maoriinvestments.co.nz",
1638 "oag.govt.nz",
1639 "office.e-agent.nz",
1640 "paekupu.co.nz",
1641 "player.vimeo.com",
1642 "rapuatearatika.education.govt.nz",
1643 "register.tpota.org.nz",
1644 "rehuamarae.co.nz",
1645 "reoora.co.nz",
1646 "sexualviolence.victimsinfo.govt.nz",
1647 "sooty.nz",
1648 "teaomaori.news",
1649 "blog.teara.govt.nz",
1650 "cdn.tehiku.nz",
1651 "tetaurawhiri.govt.nz",
1652 "tewikiotereomaori.nz",
1653 "tiritiowaitangi.govt.nz",
1654 "tmmkkm.school.nz",
1655 "ttw1.cwp.govt.nz",
1656 "ashtangatauranga.co.nz",
1657 "blushandbrows.nz",
1658 "components-mart.nz",
1659 "cruisetourstauranga.co.nz",
1660 "cs.waikato.ac.nz",
1661 "dnc.org.nz",
1662 "e-agent.nz",
1663 "electionresults.govt.nz",
1664 "electionresults.org.nz",
1665 "eventcinemas.co.nz",
1666 "hapuhauora.health.nz",
1667 "heartland.co.nz",
1668 "hrc.co.nz",
1669 "infinite-electronic.nz",
1670 "komako.org.nz",
1671 "korokikahukura.co.nz",
1672 "lcds-display.nz",
1673 "maoriinvestments.co.nz",
1674 "maoritelevision.com",
1675 "matarikifestival.org.nz",
1676 "ngamanawainc.co.nz",
1677 "oag.govt.nz",
1678 "pinterest.ca",
1679 "pinterest.co.uk",
1680 "pinterest.fr",
1681 "pinterest.it",
1682 "pinterest.jp",
1683 "pinterest.nz",
1684 "puau.school.nz",
1685 "puhaandpakeha.co.nz",
1686 "rereahu.maori.nz",
1687 "rotorua-rafting.co.nz",
1688 "rotoruanz.com",
1689 "sporty.co.nz",
1690 "stats.govt.nz",
1691 "taitokerautrust.org.nz",
1692 "takitimu.ac.nz",
1693 "tasteofplenty.co.nz",
1694 "tekura.school.nz",
1695 "tematawai.maori.nz",
1696 "terakipaewhenua.school.nz",
1697 "terito.school.nz",
1698 "tetaurawhiri.govt.nz",
1699 "tewikiotereomaori.co.nz",
1700 "tuiatematangi.ac.nz",
1701 "whanau-tahi.school.nz",
1702 "wingspan.co.nz",
1703 "zenbu.co.nz",
1704 "za.pinterest.com"
1705 ],
1706 "numPagesInMRICount" : 4360,
1707 "numPagesContainingMRICount" : 9687
1708}
1709
1710
1711NZ sites where pages are detected as being overall inMRI are more likely to contain at least one sentence inMRI.
1712Therefore, for the purpose of making the manual task of going through all NZ sites a bit easier,
1713will work with 2 query results that combine into the above:
1714- those NZ pages where numPagesInMRI > 0
1715- and the remaining NZ pages that only contain MRI (numPagesInMRI = 0 but numPagesContainingMRI > 0)
1716
1717----------------------------
1718
17192. Get NZ sites where numPagesInMRI > 0
1720
1721db.Websites.aggregate([
1722 {
1723 $match: {
1724 $and: [
1725 {numPagesInMRI: {$gt: 0}},
1726 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1727 ]
1728 }
1729 },
1730 { $unwind: "$geoLocationCountryCode" },
1731 {
1732 $group: {
1733 _id: "nz",
1734 count: { $sum: 1 },
1735 domain: { $addToSet: '$domain' },
1736 numPagesInMRICount: { $sum: '$numPagesInMRI' },
1737 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
1738 }
1739 },
1740 { $sort : { count : -1} }
1741]);
1742
1743
1744Annotating the matching domain listing as follows:
1745* First column: n pages that are in MRI / n sampled isMRI pages
1746 To check a site contains a positive number of pages in MRI:
1747 db.getCollection('Webpages').find({URL:/teipukarea\.maori\.nz/, isMRI: true})
1748* Second column: n pages that do contain MRI / n sampled pages that are not isMRI yet contain MRI
1749 Can find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI.
1750 db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true})
1751
1752
1753/* 1 */
1754{
1755 "_id" : "nz",
1756 "count" : 96.0,
1757 "domain" : [
1758 "http://www.teipukarea.maori.nz", 3/3 1/3
1759 "http://ngatipahauwera.co.nz", 2/2, 2/2
1760 "http://www.oag.govt.nz", 2/2 0/2
1761 "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3
1762 "http://tmoa.tki.org.nz", 3/3 3/3
1763 "http://www.tewhanake.maori.nz", 3/3 2/3
1764 "http://www.matarikifestival.org.nz", 4/4 0/3
1765 "http://www.otepoti.school.nz", 3/3 0/4
1766!! "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages]
1767 "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages]
1768 "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence]
1769X!! "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI album and song titles] 0 [no other pages containsMRI]
1770 "http://maori.livingheritage.org.nz", 2/2 2/2 {includes: http://www.livingheritage.org.nz}
1771 "http://pukoro.co.nz", 2/2 0/2
1772X "https://register.tpota.org.nz", 0/1 [form] 0/2
1773+ "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz", 0/4, 1/3 [but audio content may be in MRI] But there are pages containing MRI to be found by non-random sampling of tehiku.nz, e.g. https://tehiku.nz/te-hiku-radio/te-tangihanga-ki-a-erima-henare/ contains MRI sentences
1774!! "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages]
1775! "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3
1776 "http://kurataiao.tki.org.nz", 3/3, 1/total 3
1777
1778!! "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages]
1779 "http://teaohou.natlib.govt.nz", 4/4, 2/4
1780 "http://www.tuwharetoa.iwi.nz", 2/3 0/3
1781+ "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html)
1782 "https://www.terito.school.nz", 3/3, 0/2 total
1783 "https://ttw1.cwp.govt.nz", 3/3 3/3
1784 "https://www.whanau-tahi.school.nz", 4/4, 1/2 total
1785 "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total
1786 "https://teaomaori.news", 3/3, 0/1 total
1787 "http://tetaurawhiri.govt.nz", 3/3 /3/3 [Māori Language Commission site]
1788 "https://www.tuiatematangi.ac.nz", 4/4 3/3
1789 "http://animations.tewhanake.maori.nz", 3/3 3/3
1790!! "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages]
1791!! "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages]
1792 "http://www.28maoribattalion.org.nz", 3/3, 1/3
1793 "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3
1794 "http://www.brettgraham.co.nz", 1/1 total, 0/3
1795!! "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages]
1796
1797 "http://anglicanprayerbook.nz", 3/3 3/3
1798 "http://arataua.nz", 4/4, 2/3
1799 "http://maori.tki.org.nz", 3/3 3/3
1800DONE (with/out www): "http://www.firstworldwar.tki.org.nz",
1801X "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages]
1802 "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages]
1803 "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages]
1804 "https://curriculumtool.education.govt.nz", 4/4, 3/3
1805 "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] {includes: "http://www.kurakokiri.maori.nz"}
1806 "http://www.kkmmaungarongo.co.nz", 3/3, 3/3
1807 "http://www.heartland.co.nz", 3/3, 1/1 total
1808 "http://oilcrash.com", 2/2 total, 0/3
1809 "http://www.kura-porirua.school.nz", 4/4, 2/3
1810 "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages]
1811 "https://www.tematawai.maori.nz", 3/3, 3/3
1812
1813+ "https://www.terakipaewhenua.school.nz",
1814+ "http://www.tetaurawhiri.govt.nz",
1815+ "http://archive.stats.govt.nz", (1 page isMRI)
1816+ "http://tiritiowaitangi.govt.nz",
1817+!! "http://www.waiata.maori.nz", {includes: "http://waiata.maori.nz"}
1818+ "http://hana.co.nz", [crawled version of page contains MRI sentences, but current page is labelled for mobiles whereas in browser it just has a picture]
1819+ "http://kaupare.co.nz",
1820+ "http://www.tereowrap.nz",
1821?X "https://www.e-agent.nz", [autotranslated? SEO related site in NZ, Chinese, English and MRI] {includes: "https://office.e-agent.nz"}
1822 { included: "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] }
1823+ "http://www.hrc.co.nz",
1824+ "http://ngatiporoukiponeke.org.nz",
1825
1826+ "http://rurued.school.nz",
1827+ "http://www.twtop.school.nz",
1828X "https://www.infinite-electronic.nz", [autotranslated product site]
1829+!! "http://www.huri-translations.pf",
1830+ "https://admin.teara.govt.nz", e.g. https://admin.teara.govt.nz/mi/biographies/4m56/moko-pita-te-turuki-tamati {included: "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz, e.g. https://teara.govt.nz/mi/biographies/1t28/te-hapuku/media]}
1831+!! "https://tiritiowaitangi.govt.nz",
1832+ "http://www.tmoa.tki.org.nz",
1833+ "https://www.komako.org.nz", [no longer available, but crawled page had a paragraph in Maori with presumably a translation thereafter]
1834+ "http://www.wcl.govt.nz", {included: "http://kete.wcl.govt.nz" as wcl.govt.nz; 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3}
1835+!! "http://punareo.co.nz", [waiata]
1836
1837+ "https://rapuatearatika.education.govt.nz",
1838+ "http://tmmkkm.school.nz",
1839X "https://www.components-mart.nz", [autotranslated product site]
1840+ "http://www.cs.waikato.ac.nz", [Te Taka's pages!]
1841+!!! "http://www.kupengahao.co.nz", [MRI language books and resources]
1842+ "https://www.hapuhauora.health.nz", [Smokefree site. The one isMRI page has a proper sentence.]
1843X "https://www.lcds-display.nz", [autotranslated product site]
1844+ "http://cms.sunsmartschools.co.nz", [as sunsmartschools.co.nz, e.g. http://sunsmartschools.co.nz/kura/how-to-become.html]
1845+ "http://kuraproductions.co.nz",
1846+ "https://keepourmoneyclean.govt.nz", [1 page]
1847
1848+!! "http://www.tekura.school.nz",
1849+ "http://www.tkkmmokopuna.school.nz", [e.g. http://www.tkkmmokopuna.school.nz/newsletter_sets/newsletters/13-wahanga-4-wiki-5-he-puna-korero]
1850+ "http://hangaraumatihiko.tki.org.nz", [govt website. e.g. http://hangaraumatihiko.tki.org.nz/te-tupuranga-tangata-me-te-rorohiko/whakatupuranga-3-te-whakatau-tikanga-kawe-i-nga-mahi-whakaoti-hopanga/]
1851+ "http://www.pakanae.maori.nz"
1852 ],
1853 "numPagesInMRICount" : 4360,
1854 "numPagesContainingMRICount" : 7968
1855}
1856
1857
185896 sites detected as having isMRI pages - 7 sites/subdomains already included in the existing 96 = 89 sites.
1859
1860-2.5* product sites -2 non-MRI sites with songlistings or web forms etc
1861 *0.5 for e-agent.nz site
1862= 84.5 sites total that at least contain MRI, most have pages inMRI.
1863
1864We are excluding the one marked with ?X as it appears autotranslated.
1865In this set then, there are 84 sites that at least contain MRI out of 89 unique sites detected as containing pages inMRI.
1866
1867If not counting unique sites but counting the mongdb query result's subdomains separately: 84 +4 sites (non-unique or split over subdomains) in the result set contained MRI = 88 sites.
1868
1869----------------------------
1870
18713. Handling the remainder: NZ sites where numPagesInMRI = 0 BUT numPagesContainingMRI > 0
1872
1873The remainder = 80 NZ sites detected as not containing pages InMRI, but with positive number of pages detected as containsMRI:
1874
1875db.Websites.aggregate([
1876 {
1877 $match: {
1878 $and: [
1879 {numPagesContainingMRI: {$gt: 0}},
1880 {numPagesInMRI: {$eq: 0}},
1881 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1882 ]
1883 }
1884 },
1885 { $unwind: "$geoLocationCountryCode" },
1886 {
1887 $group: {
1888 _id: "nz",
1889 count: { $sum: 1 },
1890 domain: { $addToSet: '$domain' },
1891 numPagesInMRICount: { $sum: '$numPagesInMRI' },
1892 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
1893 }
1894 },
1895 { $sort : { count : -1} }
1896]);
1897
1898
1899Find pages for testing with:
1900 db.getCollection('Webpages').find({URL:/ashtangatauranga\.co\.nz/, containsMRI: true, mriSentenceCount: {$gt: 0}})
1901
1902
1903/* 1 */
1904{
1905 "_id" : "nz",
1906 "count" : 80.0,
1907 "domain" : [
1908X "http://www.zoomin.co.nz", [map site, so placenames]
1909X "http://www.biketorqueyamaha.co.nz", [placenames] {includes "http://biketorqueyamaha.co.nz"}
1910X "http://archerpix.com", [photo captions containing placenames]
1911X "http://philipbeadle.co.nz", [art captions containing placenames]
1912X "https://2019.nethui.nz", [Just MRI words in ENG sentences]
1913X "http://crimson.co.nz", [address]
1914+ "http://holyspirit.nz", (e.g. https://holyspirit.nz/wp-content/uploads/02-25-2018-Newsletter.pdf)
1915X "https://www.wingspan.co.nz", [1 page, 1 conjoined word, looks like placename]
1916X "http://nzpostcard.co.nz", [postcards with placenames]
1917+ "https://www.ngamanawainc.co.nz", (e.g. https://www.ngamanawainc.co.nz/history/id/60) {includes "http://www.ngamanawainc.co.nz"}
1918
1919+ "http://www.finlaysonpark.school.nz", [e.g. http://www.finlaysonpark.school.nz/58/pages/61-team-overview; Some Tongan language pages and Samoan pages]
1920X "http://artizani.co.nz", [address]
1921+ "http://www.w3vietnam.org.nz", [e.g. http://www.w3vietnam.org.nz/w3%20mihi%20ki%20ka%20hoia.htm] (includes "http://w3vietnam.org.nz")
1922X "https://sooty.nz", [names, war death notices, place names]
1923X? "http://rakaumanga.school.nz", [School has a longer Maori title name, but that's it] {includes "http://www.rakaumanga.school.nz"}
1924X "http://www.rotoruanz.com", [e.g. https://www.rotoruanz.com/RNZ/media/Media-Library/SOI-Board-version-08-June-2018-final-published-on-website.pdf]
1925X "https://www.cruisetourstauranga.co.nz", [English with Tauranga being MRI placename]
1926X "http://www.jeremybaker.nz", [one word, HOkio]
1927
1928X "https://liveresults.co.nz", [canoe sports team names]
1929X "http://rexedra.gen.nz", [ENG sentence with MRI words]
1930+ "https://www.takitimu.ac.nz", [school of performing arts, example MRI sentence at https://www.takitimu.ac.nz/about-us]
1931X "http://www.electionresults.govt.nz", [placenames and misdetection of Return to Home Page] {includes https://www.electionresults.org.nz which seems to be an alias for the same} {includes "http://archive.electionresults.govt.nz"}
1932+ "https://kotahimiriona.co.nz", e.g. (https://kotahimiriona.co.nz/about-the-app/)
1933+ "https://rehuamarae.co.nz", (e.g. https://rehuamarae.co.nz/)
1934+ "http://reoora.co.nz", (e.g. https://reoora.co.nz/about/)
1935
1936X "http://otorohanga.directorybusiness.co.nz", [placenames]
1937X "http://waitarahistory.org.nz", [placenames and misdetection of sentences of the form "31, No 262." as MRI]
1938+ "https://manawatuheritage.pncc.govt.nz", (e.g. https://manawatuheritage.pncc.govt.nz/about)
1939+ "http://rsnz.natlib.govt.nz", [e.g. http://rsnz.natlib.govt.nz/volume/rsnz_26/rsnz_26_00_004560.html] NOTE: site appears to use Greenstone
1940X "https://www.rotorua-rafting.co.nz", [placenames]
1941+ "https://www.taitokerautrust.org.nz", (e.g. https://www.taitokerautrust.org.nz/)
1942+ "http://tewikiotereomaori.nz", (e.g. https://tewikiotereomaori.nz/about/)
1943+ "https://www.korokikahukura.co.nz", (e.g. https://www.korokikahukura.co.nz/ko-wai-m257tou.html about the connection to the Waikato River)
1944
1945X "https://www.puhaandpakeha.co.nz", [ENG sentences with MRI words]
1946X "http://myfathersworld.net.nz", [placenames]
1947X "https://www.ashtangatauranga.co.nz", [misdetection]
1948+ "https://www.pinterest.nz", (e.g. sentence "E tu Whare Rangi, e tu Whare Pataka, kia ora koutou nga Whare Tipuna o te Ao Maori." at https://www.pinterest.nz/amp/marylizw03/maori/)
1949+ "https://www.rereahu.maori.nz", (e.g. https://www.rereahu.maori.nz/uploads/72717/files/168860/Te_Maru_o_Rereahu_Newsletter_No_5_February_2011.pdf)
1950+ "http://givealittle.co.nz", (e.g. https://givealittle.co.nz/cause/rebuild-tapu-te-ranga-marae which contains greetings phrases and sentence ""Nā te ringa tangata i hanga te whare Nā te tuarā o te whare i whakatipu i te tangata")
1951X "http://www.gans.co.nz", [placenames]
1952+ "https://kaiiwicamp.nz", [placenames] {includes "http://kaiiwicamp.nz"}
1953+ "http://ngarauhuia.ngatiapakiterato.iwi.nz", (e.g. http://ngarauhuia.ngatiapakiterato.iwi.nz/pdf/Nga-Rau-Huia_Raumati_2015_Issue1.pdf)
1954+ "https://m.wairarapatv.co.nz", (e.g. https://m.wairarapatv.co.nz/archive/i/JGmJqCZCNK8/te-ara-whanui-kura-kaupapa-maori-o-nga-khanga-reo-o-te-awa-kairangi)
1955
1956X "http://www.methodist.org.nz", [ENG sentence with MRI words]
1957+ "http://avonside.net", (e.g. http://avonside.net/Nicky/indexmain.htm)
1958X "http://www.ruralfind.co.nz", [placenames]
1959+ "http://www.maoriinvestments.co.nz", (e.g. Vision, Mission & Values tab of https://maoriinvestments.co.nz/about/organisation)
1960+ "http://conference.tpwt.maori.nz", (e.g. https://www.tpwt.maori.nz/te-mahe-matauranga/)
1961+ "https://www.puau.school.nz", (e.g. https://www.puau.school.nz/k%C4%81inga-home)
1962+? "http://ngatiwhakaue.iwi.nz", (e.g. greetings at http://ngatiwhakaue.iwi.nz/)
1963X "http://www.nzpcn.org.nz", [lots of misdetections and one sentence in English with MRI words, e.g. "Kaori is a Taonga to Maori".] {includes "http://dev.nzpcn.org.nz"}
1964+? "https://interactives.stuff.co.nz", [2 x page mostly empty except for the title "TE WIKI O TE REO MĀORI Māori"]
1965+ "http://tehauora.org.nz", (e.g. http://tehauora.org.nz/ and proverb at bottom of http://tehauora.org.nz/about-us)
1966
1967+ "http://temahurehure.maori.nz", (e.g. greeting http://temahurehure.maori.nz/site/file/temp/hirebook.pdf)
1968X "http://pukekohe.directorybusiness.co.nz", [placenames]
1969+!! "http://www.temarareo.org", (dictionary with sample sentences. e.g. http://www.temarareo.org/PAPAKUPU/dictionary-searchresults/A-2.htm)
1970X "https://www.tasteofplenty.co.nz", [Tauranga placename and misdetection of "No one went away hungry] {includes "http://www.tasteofplenty.co.nz"}
1971
1972+ "http://www.tetaumuturunanga.iwi.nz", (e.g. http://www.tetaumuturunanga.iwi.nz/wp-content/uploads/2016/04/Malvern-Cultural-Narrative-Draft.pdf)
1973
1974
1975X "https://www.blushandbrows.nz", [misdetection of "Makeup..."]
1976X "http://talkingtothecan.com", [misdetection of things like ""22, no." and mistaking ENG sentences with MRI words]
1977
1978+? "http://whatonga.school.nz", [school title]
1979+? "https://player.vimeo.com", [Video titles contain MRI sentences and MRI titles interspersed in ENG sentence. Video's audio content may be in MRI]
1980+ "http://www.writersfestival.co.nz", (e.g. http://www.writersfestival.co.nz/news/Page1/urgently-relevant-novel-wins-countrys-richest-literary-award/)
1981+? "http://southerntribes.co.nz", [brazilian jiu jitsu site with a greeting in MRI on main page]
1982+ "http://www.kmk.maori.nz", (e.g. http://www.kmk.maori.nz/kmk-events)
1983+ "https://www.stats.govt.nz", (e.g. http://archive.stats.govt.nz/Census/2001-census-data/2001-census-pacific-profiles/cook-island-maori-people-in-new-zealand.aspx)
1984X "http://www.eventcinemas.co.nz", [placenames and misdetection of ENG "Atura Hotels"]
1985X "https://www.zenbu.co.nz" [misdetection and NZ school addresses]
1986 ],
1987 "numPagesInMRICount" : 0,
1988 "numPagesContainingMRICount" : 1673
1989}
1990
199180 sites detected as having 0 pages inMRI but >0 pages that containMRI.
1992
1993[Of these 9 are part of the same site/subdomain => 71 unique sites.
1994Of the remaining ones, only 35 have at least one sentence in Maori and are marked with +. (Those marked with +? just have Maori titles or greetings or nothing more than a sentence.)
1995So in this set, there's a further 35 sites that contain MRI out of 71 unique sites detected as having pages containingMRI but not pages inMRI.
1996Total sites: 35/71
1997Total for NZ: (84+35)/(89+71) = 119/160 unique NZ sites have at least one webpage containing at least one sentence inMRI.
1998]
1999
2000TOTAL:
2001If counting subdomains and duplicated sites distinctly, then 35 + an additional 3 sites, making it 38/80 sites in this set.
2002
2003This makes (88+38)/(96+80) = 126/176 NZ sites (counting distinct subdomains and duplicated sites) that contain at least one web page with at least 1 sentence in MRI.
2004
2005
2006
2007
20083. GRAND TOTALS
2009
2010Count per country of web SITES that contain at least 1 web page containing at least 1 genuine MRI sentence:
2011
2012countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI
2013NZ: 126 actual sites out of 176 detected sites
2014US: 29 actual out of 486 detected sites
2015AU: 2 actual out of 21 detected sites
2016DE, Germany: 2 actual out of 27 detected sites
2017DK, Denmark: 2 out of 8
2018BG, Bulgaria: 1 out of 1
2019CZ, Czech Republic: 1 out of 4
2020ES, Spain: 1 out of 7
2021FR, France: 1 out of 36
2022IE, Ireland: 1 out of 2
2023
2024TOTAL: 166 sites of all the crawled sites where the crawled set of pages per site actually contained at least one sentence in Māori based on manual inspection.
2025Out of a total of 221+471+176 = 868 sites that were detected with numPagesContainingMRI > 0 (868 sites containing at least one page with at least one sentence detected in MRI)
2026
2027========================================
Note: See TracBrowser for help on using the repository browser.