source: other-projects/maori-lang-detection/mongodb-data/ManualShortlisting.txt@ 34011

Last change on this file since 34011 was 33914, checked in by ak19, 4 years ago

Shortlisted just the domain sites by country into ManualShortlist2.txt after taking the reingest into MongoDB into account. And then put all these shortlisted domains for which containsMRI=true as per manual inspection into a separate new file.

File size: 76.8 KB
Line 
1Want to MANUALLY go over all sites that are detected as containing one or more pages with at least an MRI sentence
2and shortlist those sites genuinely containing at least one MRI sentence.
3
4
5Total num sites detected as containing MRI:
6 db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
7=868
8
9
10To make the manual task easier,
11splitting the results of all sites with numPagesContainingMRI > 0 into NZ sites and overseas sites,
12since NZ sites are more likely to contain MRI content.
13
14-----------------------------------------------------------
15A. OVERSEAS SITES: sites not NZ in origin NOR .nz TLD SITES
16-----------------------------------------------------------
17Further splitting the overseas sites into a set with an mi in the URL path (mi.* or */mi) and those without,
18since overseas sites with mi in the URL path are more likely to be automatically translated product sites.
19
201. db.getCollection('Websites').find(
21{$and: [
22 {numPagesContainingMRI: {$gt: 0}},
23 {geoLocationCountryCode: {$ne: "NZ"}},
24 {domain: {$not: /.nz$/}},
25 {urlContainsLangCodeInPath: {$ne: true}}
26]}).count()
27
28= 220 websites
29
30[Treating Australia as a special case since one of the 4 Australian sites with numPagesContainingMRI > 0
31had an mi in the URL path but was not automatically translated
32
33# counts by country code excluding NZ related sites
34
35db.getCollection('Websites').find({$and: [
36 {geoLocationCountryCode: {$ne: "NZ"}},
37 {domain: {$not: /\.nz/}},
38 {numPagesContainingMRI: {$gt: 0}},
39 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
40 ]}).count()
41
42= 221 websites
43]
44
45
46Getting a domain listing of the sites that matched, per country:
47db.Websites.aggregate([
48 {
49 $match: {
50 $and: [
51 {geoLocationCountryCode: {$ne: "NZ"}},
52 {domain: {$not: /\.nz/}},
53 {numPagesContainingMRI: {$gt: 0}},
54 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
55 ]
56 }
57 },
58 { $unwind: "$geoLocationCountryCode" },
59 {
60 $group: {
61 _id: {$toLower: '$geoLocationCountryCode'},
62 count: { $sum: 1 },
63 domain: { $addToSet: '$domain' }
64 }
65 },
66 { $sort : { count : -1} }
67]);
68
69
70 /* 1 */
71 {
72 "_id" : "us",
73 "count" : 117.0,
74 "domain" : [
75 "http://mikebonnice.com",
76 "https://nl.pinterest.com",
77 "http://svenskadress.net",
78 "http://word-dialect.blogspot.com",
79 "http://fhr.kiwicelts.com",
80 "http://www.huapala.org",
81 "http://www.whoisthatr.com",
82 "http://www.precious-testimonies.com",
83 "https://www.oemsec.com",
84 "http://www.godrules.net",
85 "https://www.pinterest.it",
86 "http://www.wikitree.com",
87 "http://ritusehji.blogspot.com",
88 "http://www.frogsonline.com",
89 "https://biblehub.com",
90 "https://www.pinterest.co.uk",
91 "http://pumanawawhangara.blogspot.com",
92 "http://hannas-reiseblog.blogspot.com",
93 "http://frontrowphotos.com",
94 "https://www.pinterest.ca",
95 "http://www.muhammad.com",
96 "https://www.pinterest.jp",
97 "http://www.gotquestions.org",
98 "https://www.dbnames.net",
99 "http://www.hudl.com",
100 "https://ebible.org",
101 "http://tuhua2010.blogspot.com",
102 "http://ww25.milfsplease.com",
103 "http://www.thesalmons.org",
104 "https://wol.jw.org",
105 "http://georgegi.tripod.com",
106 "http://linkvip.top",
107 "https://docs.google.com",
108 "http://rangiwewehi.com",
109 "http://anglicanhistory.org",
110 "http://niken8media.logdown.com",
111 "http://mrshamiltonskoolkidz.blogspot.com",
112 "https://www.vaihaunui.net",
113 "http://dannykahei.tripod.com",
114 "http://www.lunar-occultations.com",
115 "http://seapixonline.com",
116 "http://tkrow.tripod.com",
117 "https://drive.google.com",
118 "http://takethatvacation.com",
119 "https://in.pinterest.com",
120 "https://www.nccri.ie",
121 "https://www.webwiki.com",
122 "http://www.unicode.org",
123 "http://shangrilapress.net",
124 "http://ngarangatahi.tripod.com",
125 "https://static-promote.weebly.com",
126 "https://www.podrozeady.com",
127 "https://www.blue-frontiers.com",
128 "https://www.indexmundi.com",
129 "http://www.namesdir.com",
130 "https://www.bible.com",
131 "http://www.krassotkin.ru",
132 "http://malecek.com",
133 "http://korora.econ.yale.edu",
134 "https://www.poehalisnami.ua",
135 "http://loquevendra318.com",
136 "https://www.terakau.org",
137 "https://za.pinterest.com",
138 "http://www.mkiwi.com",
139 "http://maaori.com",
140 "http://atopeconlostopes.blogspot.com",
141 "http://worldradiomap.com",
142 "http://eartheum.com",
143 "http://www.forensicfashion.com",
144 "http://www.code-postal.com",
145 "http://www.pressreader.com",
146 "https://www.seapixonline.com",
147 "http://lianzaconference2012.blogspot.com",
148 "http://blogdepasopor.blogspot.com",
149 "https://www.code-postal.com",
150 "http://www.steve-wheeler.co.uk",
151 "https://www.knowatom.com",
152 "http://bahaiprayers.net",
153 "http://www.eyecontactsite.com",
154 "http://www.hiroa.pf",
155 "http://mahoraroom8.blogspot.com",
156 "http://www.roadsmile.com",
157 "https://chromium.googlesource.com",
158 "http://aclhokiangarocks.blogspot.com",
159 "http://wowwars.net",
160 "https://www.hidroponia.org.mx",
161 "http://tkkpipipaopao.blogspot.com",
162 "http://tatai09.blogspot.com",
163 "http://kiaorahola.blogspot.com",
164 "http://manateina.blogspot.com",
165 "http://www.the-naked.com",
166 "http://shuttersportnelson.photoshelter.com",
167 "http://precious-testimonies.com",
168 "https://www.breaker.audio",
169 "https://www.natekore2018.com",
170 "http://naturalfatburner.net",
171 "https://www.pinterest.fr",
172 "https://www.pipirikiapapatuanuku.org",
173 "http://capsuraotearoa.blogspot.com",
174 "http://m.biblepub.com",
175 "https://phet.colorado.edu",
176 "https://livestream.com",
177 "http://www.geni.com",
178 "https://kjohnsonnz.blogspot.com",
179 "https://maorinews.com",
180 "http://www.twttoa.com",
181 "http://www.whoisentry.com",
182 "http://burkekm001.tripod.com",
183 "http://wikiedit.org",
184 "http://piripi.blogspot.com",
185 "https://www.kaifineart.com",
186 "https://png.bible",
187 "http://rhymebrain.com",
188 "http://www.v3whois.com",
189 "http://www.waimate.com",
190 "https://www.myadsclassified.com"
191 ]
192 }
193
194 /* 2 */
195 {
196 "_id" : "de",
197 "count" : 19.0,
198 "domain" : [
199 "http://www.udhr.de",
200 "http://m.distanta.1km.net",
201 "http://arts.mythologica.fr",
202 "http://vulkane.ch",
203 "http://www.behlig.de",
204 "http://www.nierstrasz.org",
205 "https://www.tvteile.de",
206 "http://etymologie.info",
207 "https://www.cartogiraffe.com",
208 "https://www.you-fly.com",
209 "http://klaaskoehne.de",
210 "http://weltderberge.de",
211 "http://www.cartogiraffe.com",
212 "http://svenkirsten.com",
213 "https://laskar02cinta.page.tl",
214 "http://etoile-de-lune.net",
215 "https://ersatzteile-fachversand.de",
216 "http://insecta.pro",
217 "http://www.stephe.de"
218 ]
219 }
220
221 /* 3 */
222 {
223 "_id" : "fr",
224 "count" : 16.0,
225 "domain" : [
226 "http://baladeornithologique.com",
227 "http://chantsdeluttes.free.fr",
228 "http://kihikihi.fr",
229 "http://www.blueheavenisland.com",
230 "http://splaf.free.fr",
231 "https://www.lexilogos.com",
232 "https://www.manualscat.com",
233 "http://rapanui.fr",
234 "http://www.gototahiti.net",
235 "http://pt.city-usa.net",
236 "http://blueheavenisland.com",
237 "http://www.gif.ovh",
238 "http://www.gaudry.be",
239 "http://www.maraamusurfskirace.com",
240 "http://mahajana.net",
241 "http://www.rongo-rongo.com"
242 ]
243 }
244
245 /* 4 */
246 {
247 "_id" : "nl",
248 "count" : 16.0,
249 "domain" : [
250 "http://tetsubo.org",
251 "https://arrowhead.eu",
252 "https://www.arrowhead.eu",
253 "https://www.henrifloor.nl",
254 "http://hidsonphoto.com",
255 "https://arrowheadproject.azurewebsites.net",
256 "http://nielsonboutique.co.uk",
257 "http://www.gouvernante.info",
258 "http://tonhut.nl",
259 "http://diverosa.com",
260 "http://longhornlaw.net",
261 "http://www.nonlinear.demon.nl",
262 "http://wearehomework.com",
263 "http://gouvernante.info",
264 "http://skimap.info",
265 "http://www.encyclo.co.uk"
266 ]
267 }
268
269 /* 5 */
270 {
271 "_id" : "dk",
272 "count" : 8.0,
273 "domain" : [
274 "http://powhiri.ngapuhitelevision.com",
275 "http://akona.ngapuhitelevision.com",
276 "http://ngapuhitelevision.com",
277 "http://ngapuhiradio.com",
278 "http://komisch.ngapuhitelevision.com",
279 "http://www.rennertweb.de",
280 "http://jazz.ngapuhitelevision.com",
281 "http://waiatarangatiratanga.ngapuhitelevision.com"
282 ]
283 }
284
285 /* 6 */
286 {
287 "_id" : "ca",
288 "count" : 7.0,
289 "domain" : [
290 "http://00.gs",
291 "http://daandehn.com",
292 "http://aguadilla.airport-authority.com",
293 "http://bckayak.com",
294 "http://bcmarina.com",
295 "https://articles.imperialtometric.com",
296 "http://www.myrasplace.net"
297 ]
298 }
299
300 /* 7 */
301 {
302 "_id" : "au",
303 "count" : 5.0,
304 "domain" : [
305 "https://infogram.com",
306 "https://www.kiwiproperty.com",
307 "http://theunderwaterworld.com",
308 "http://fionajack.net",
309 "https://koreromaori.com"
310 ]
311 }
312
313 /* 8 */
314 {
315 "_id" : "cz",
316 "count" : 4.0,
317 "domain" : [
318 "http://www.henryklahola.nazory.cz",
319 "http://about.ilikeyou.com",
320 "http://henryklahola.nazory.cz",
321 "https://www.fipojobs.com"
322 ]
323 }
324
325 /* 9 */
326 {
327 "_id" : "gb",
328 "count" : 4.0,
329 "domain" : [
330 "http://www.wordsearchfun.com",
331 "http://www.woolrych.org",
332 "https://omniatlas.com",
333 "http://mikestephens.co.uk"
334 ]
335 }
336
337 /* 10 */
338 {
339 "_id" : "es",
340 "count" : 4.0,
341 "domain" : [
342 "http://www.info-hoteles.com",
343 "https://www.uv.es",
344 "https://www.reclamaciondevuelos.com",
345 "http://www.cruceros-princess.mx"
346 ]
347 }
348
349 /* 11 */
350 {
351 "_id" : "at",
352 "count" : 3.0,
353 "domain" : [
354 "http://www.petit-prince.at",
355 "http://www.tmtmm.net",
356 "http://petit-prince.at"
357 ]
358 }
359
360 /* 12 */
361 {
362 "_id" : "it",
363 "count" : 3.0,
364 "domain" : [
365 "http://oipaz.net",
366 "http://www.pegasoesmicamion.com",
367 "http://www.marcosanti.it"
368 ]
369 }
370
371 /* 13 */
372 {
373 "_id" : "il",
374 "count" : 2.0,
375 "domain" : [
376 "http://www.daat.ac.il",
377 "https://www.hitiaotera.com"
378 ]
379 }
380
381 /* 14 */
382 {
383 "_id" : "ch",
384 "count" : 2.0,
385 "domain" : [
386 "https://nicoledidi.ch",
387 "https://photos.axelebert.org"
388 ]
389 }
390
391 /* 15 */
392 {
393 "_id" : "ro",
394 "count" : 2.0,
395 "domain" : [
396 "http://www.parohiauceadesus.ro",
397 "http://parohiauceadesus.ro"
398 ]
399 }
400
401 /* 16 */
402 {
403 "_id" : "mx",
404 "count" : 1.0,
405 "domain" : [
406 "http://www.gelbukh.com"
407 ]
408 }
409
410 /* 17 */
411 {
412 "_id" : "unknown",
413 "count" : 1.0,
414 "domain" : [
415 "https://www.viveipcl.com"
416 ]
417 }
418
419 /* 18 */
420 {
421 "_id" : "bg",
422 "count" : 1.0,
423 "domain" : [
424 "http://anitra.net"
425 ]
426 }
427
428 /* 19 */
429 {
430 "_id" : "cn",
431 "count" : 1.0,
432 "domain" : [
433 "http://kiwi2china.com"
434 ]
435 }
436
437 /* 20 */
438 {
439 "_id" : "ir",
440 "count" : 1.0,
441 "domain" : [
442 "https://www.dideo.ir"
443 ]
444 }
445
446 /* 21 */
447 {
448 "_id" : "fi",
449 "count" : 1.0,
450 "domain" : [
451 "http://pertti.com"
452 ]
453 }
454
455 /* 22 */
456 {
457 "_id" : "ie",
458 "count" : 1.0,
459 "domain" : [
460 "https://coggle.it"
461 ]
462 }
463
464 /* 23 */
465 {
466 "_id" : "ru",
467 "count" : 1.0,
468 "domain" : [
469 "https://www.gismeteo.lv"
470 ]
471 }
472
473 /* 24 */
474 {
475 "_id" : "jp",
476 "count" : 1.0,
477 "domain" : [
478 "http://yutaka.it-n.jp"
479 ]
480 }
481
482
483
484Can inspect websites' pages for whether it's relevant vs auto-translated as follows:
485 db.getCollection('Webpages').find({URL:/svenkirsten.com/, mriSentenceCount: {$gt: 0}})
486
487
488* CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/
489 BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/
490
491* FR: 16 sites from FR
492 http://blueheavenisland.com, http://www.blueheavenisland.com - misdetection. French Polynesia
493 https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway
494 http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori.
495!! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
496 http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names
497X http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
498 http://mahajana.net - misdetected a Japanese Zen Buddhist chant as MRI
499 http://rapanui.fr - Rapa Nui Easter Island. Misdetected.
500 http://www.gif.ovh - autotranslated pages. Supposedly a GIF repository
501 http://baladeornithologique.com - misdetection of the word "Retour"
502 http://www.gaudry.be - misdetection of Japanese hiragana etc, and French "faire", as MRI
503 http://www.gototahiti.net - probably misdetection, see title
504 http://www.maraamusurfskirace.com - Bora Bora, French Polynesia. Misdetected.
505 http://www.rongo-rongo.com - appears to be related to Easter Island. Just 1 sentence however.
506 http://pt.city-usa.net - misdetection. Hawaii.
507 https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages.
508NL:
509(!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm]
510- https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL
511- tonhut.nl - misidentication
512? http://nielsonboutique.co.uk, http://longhornlaw.net, http://tetsubo.org, http://hidsonphoto.com, http://wearehomework.com/- Feels autotranslated, but no language options visible. All SEO related
513- diverosa.com - Rapa Nui, Easter Island
514- nonlinear.demon.nl - misidentified
515- encyclo.co.uk - misidentification
516- henrifloor.nl - misidentification
517- http://skimap.info/ - maps, NZ placenames in PDF
518DK:
519!! ++ http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,
520http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com,
521http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com
522- http://www.rennertweb.de - a photogallery page mentioning NZ placenames
523CA:
524- http://bcmarina.com AND http://bckayak.com - photos with Canadian placenames
525- http://www.myrasplace.net - pagse of photos, captions involving NZ placenames
526~ http://00.gs/Maniapoto;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing
527- aguadilla.airport-authority.com - misidentification
528- https://articles.imperialtometric.com - misidentification
529- http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames
530DE:
531- http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful
532!! https://www.cartogiraffe.com/ and https://www.cartogiraffe.com - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
533~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden"
534- herocity - autotranslated
535- weltderberge.de - 3 pages mention NZ mountains by name.
536~ (arts.mythologica.fr) https://mythologica.fr/oceanie/texte/pantheon_polynesien.pdf - mentions certain Maori Gods and other Polynesian Gods by name.
537- https://traynews.com - nothing in MRI, misdetected
538~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names
539- http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI
540X https://afrikhepri.org/mi/ - autotranslated
541- https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word
542- etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected
543- https://www.you-fly.com - misdetection of German "Warum?" as MRI
544- http://vulkane.ch - misdetected pages on Hawaiian volcanoes.
545- http://www.stephe.de - photos from NZ captioned with NZ placenames
546- http://insecta.pro - misdetection
547- http://m.distanta.1km.net - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence.
548- https://ersatzteile-fachversand.de - German misdetected as Maori.
549- https://laskar02cinta.page.tl/Info.htm - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense.
550- http://www.behlig.de - misdetection. Photos from Hawaii.
551!! http://www.udhr.de - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available.
552- ITALY:
553 http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename
554 http://www.marcosanti.it/Reportage/Oceania_ph/Nuova_Zelanda/ - each photo captioned by NZ placename
555 http://www.pegasoesmicamion.com/ - REO abbreviation misidentified, also in REO%20PUBLICIDAD.htm
556- AUSTRIA:
557 petit-prince.at - Tahitian and Wayuu (Venezuela) translations of Le Petit Prince
558 http://www.tmtmm.net/newzealand - photos from NZ named after places and people's names
559- ROMANIA: parohiauceadesus.ro - Sentences of single Romanian words misidentified.
560- ISRAEL:
561 http://www.daat.ac.il - misidentification of "no." as MRI, and Hebrew words.
562 https://www.hitiaotera.com/ - misidentifiation of Tahitian pages
563- RUSSIA: https://www.gismeteo.lv - misidentification of an email address
564- JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI
565!! - Ireland, ie: https://coggle.it
566- IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website
567- CZECH republic:
568? https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English
569!! http://www.henryklahola.nazory.cz/094.Maori.htm and http://henryklahola.nazory.cz variant
570 http://about.ilikeyou.com - dating site. Misidentification.
571- SPAIN:
572!! https://www.uv.es/~pla/red.net/intmaori.html
573 https://www.reclamaciondevuelos.com - 2 occurrences of the word "kiwi"
574 http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename
575 http://www.cruceros-princess.mx/princessMX/Oferta_Cruzeiros_Polinesia.html - Polynesian placenames
576- SINGAPORE: https://omg-solutions.com - autotranslated
577- TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated
578- MEXICO: http://www.gelbukh.com - misidentification, lines of just numbers or phrases like "Area Chair" in English and Russian CVs.
579- FINLAND: http://pertti.com - travelogue, placenames
580- SWITZERLAND CH:
581 nicoledidi.ch - blog, placenames
582 https://photos.axelebert.org - Tahiti related content
583- UNKNOWN: https://www.viveipcl.com: tours website, placenames mentioned
584#- EU: https://www.the-good-stuff-factory.be/mi/ : Autotranslated
585!! - BULGARIA: http://anitra.net/activism/humanrights/UDHR/rrt_print.htm (2 pages)
586
587
588AUSTRALIA:
589!! https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]
590? http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!")
591X!! https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd - site of individual pages (like docs.google.com). This one has a relevant infogram image. But it's English with MRI in the image legend and captions.
592!! https://koreromaori.com - some actual Maori language sentences
593 http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames
594
595UK:
596 http://www.wordsearchfun.com/200628_Word_Find_wordsearch.html - 2 word games with Maori words (one of them has 3 different views, e.g. print view)
597? https://omniatlas.com/maps/australasia/18400206/plain/ - historical map with Maori iwi names over NZ map regions
598? https://omniatlas.com/maps/australasia/18400206/ - historical map of Australia and NZ at the time of the Treaty of Waitangi, with events marked in English
599 https://centrallanguageschool.com - AUTOTRANSLATED
600 https://www.solasolv.com - Autotranslated product site
601 http://mikestephens.co.uk/ - photo captions containing NZ placenames
602 http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames
603
604
605US:
606Done: manually inspected 68/117 sites
607
608TOTAL US: 4+7+7+4+3=25
609
610DEFINITELY:
611+ http://anglicanhistory.org,
612+ http://www.unicode.org, [Universal declaration of Human Rights]
613+ https://static-promote.weebly.com,
614+ http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY. At least short or partial MRI sentences.]
615
616BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations:
617+ http://bahaiprayers.net, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too]
618+ https://biblehub.com,
619+ http://www.muhammad.com, [possibly not autotranslated]
620+ http://www.godrules.net, [possibly not autotranslated]
621+ http://m.biblepub.com,
622+ http://www.krassotkin.ru, [probably real translations, as there are multiple Dutch translations from different sources provided]
623+ http://www.gotquestions.org, [doesn't appear autotranslated]
624X https://ebible.org, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori]
625X https://www.bible.com, doesn't have Maori translation. Misdetected.
626X https://wol.jw.org, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected
627X https://png.bible, [misdetected, Papua New Guinea]
628X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters.
629
630CHECK, PROBABLY - PROCESSED:
631!! https://maorinews.com,
632!! http://maaori.com,
633!!+ http://kiaorahola.blogspot.com,
634+ https://kjohnsonnz.blogspot.com,
635+ http://pumanawawhangara.blogspot.com,
636+ http://dannykahei.tripod.com,
637+ http://burkekm001.tripod.com,
638+ http://tkkpipipaopao.blogspot.com,
639+ http://manateina.blogspot.com,
640? tkkpipipaopao.blogspot.com? http://rangiwewehi.com, [English, but community]
641? https://www.terakau.org, [COMMUNITY, but English]
642? https://www.pipirikiapapatuanuku.org, [COMMUNITY?, in English, environment site]
643~ http://georgegi.tripod.com,
644~ http://ngarangatahi.tripod.com, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden]
645X http://fhr.kiwicelts.com,
646X http://tkrow.tripod.com, [English, background of NZ place]
647X http://www.mkiwi.com, - placenames
648X http://www.waimate.com, [English, NZ place]
649
650MAYBE, INSPECT - PROCESSED:
651? https://www.natekore2018.com, [lots of English, but COMMUNITY, CULTURE]
652+ http://tatai09.blogspot.com,
653+ http://www.twttoa.com,
654+ http://tuhua2010.blogspot.com,
655X http://www.huapala.org, [misdetected, Hawaiian]
656X https://www.vaihaunui.net, [misdetected, Tahiti]
657X https://www.kaifineart.com, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected]
658X http://mahoraroom8.blogspot.com, [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI
659+ http://piripi.blogspot.com,
660X http://www.hiroa.pf, [misdetected. Crawled content appears Polynesian not Maori]
661X http://korora.econ.yale.edu, [NZ place photo caption]
662X https://www.poehalisnami.ua, [mostly Cyrillic, with some NZ or Polynesian names misdetected]
663X http://hannas-reiseblog.blogspot.com - one page contained NZ placenames, another had a word misdetected
664
665
666+ https://www.breaker.audio, [audio, with occasional English.]
667? https://livestream.com, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?]
668
669X https://docs.google.com, timetable with occasional Maori language word
670+ https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position.
671http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
672
673
674PINTEREST
675+ https://in.pinterest.com/pin/317363104978423418/
676 "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google"
677? https://za.pinterest.com/pin/524669425310419500/
678 Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE]
679[The other pinterest detected as numPagesContainingMRI > 0 was misdetected]
680
681https://nl.pinterest.com,
682https://www.pinterest.jp,
683https://www.pinterest.it,
684https://www.pinterest.co.uk,
685https://www.pinterest.ca,
686https://za.pinterest.com,
687https://www.pinterest.fr,
688https://in.pinterest.com,
689
690MORE BLOGSPOTS
691X http://word-dialect.blogspot.com, [Indonesian, misdetected]
692~ http://atopeconlostopes.blogspot.com, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.]
693X http://lianzaconference2012.blogspot.com, [NZ placename or institution]
694? http://mrshamiltonskoolkidz.blogspot.com, [te reo Maori related school activities. Described in English.]
695X http://capsuraotearoa.blogspot.com, [blog in French, photo captions contain NZ placenames]
696X http://blogdepasopor.blogspot.com, [blog in French, Rapa Nui/Easter Island related content, misdetected.]
697
698
699UNLIKELY
700?? http://naturalfatburner.net, http://naturalfatburner.net/NoNonsenseTed/fatloss-mao/ feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick]
701
702
703BLACKLIST:
704X http://ww25.milfsplease.com,
705X http://www.the-naked.com
706
707OTHER:
708X http://seapixonline.com, https://www.seapixonline.com, [photo captions of ships. Sometimes misdetected Japanese words as MRI.]
709X http://www.code-postal.com, https://www.code-postal.com, [not more than 1 sentence detected as in MRI]
710X https://www.dbnames.net, [Name database, lots misdetected]
711
712STILL TO DO LIST - PROCESSED:
713
714X https://www.myadsclassified.com, [misdetected 3 short English sentences as MRI]
715X http://www.whoisthatr.com, [misdetected short English sentence as MRI]
716X https://www.oemsec.com, [autotranslated product site]
717X http://svenskadress.net, [linkfarm like site of related junk links, contained URLs misdetected as MRI]
718
719X https://www.webwiki.com, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ]
720X http://mikebonnice.com, [Hawaiian and Tahiti related content misdetected]
721X http://www.hudl.com, [misdetected short English sentence as MRI]
722X http://www.wikitree.com, [misdetected short English sentence as MRI]
723X http://shuttersportnelson.photoshelter.com, [image captions of "Wairua Warrior"]
724
725X http://niken8media.logdown.com, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.]
726X https://www.podrozeady.com, Looks Polish or other East-European language. The NZ page https://www.podrozeady.com/NZ/4/ had placenames detected.
727
728X http://www.thesalmons.org, [detection and misdetection of author names of papers hosted]
729
730X http://linkvip.top, [.rar and media file links misdetected as MRI]
731
732
733X http://www.lunar-occultations.com, [NZ place names for astronomical phenomena]
734X http://shangrilapress.net, [NZ placenames]
735X http://malecek.com, [misdetection CD title]
736X https://www.blue-frontiers.com, [Tahitian, Reo Tahiti, misdetected as MRI]
737X http://www.whoisentry.com, [URL names, looked at several which were probably misdetected as MRI]
738X http://loquevendra318.com, [uses Google translate for auto-translation]
739
740
741?? http://www.forensicfashion.com, [historical information, useful for CULTURE? e.g. http://www.forensicfashion.com/1807MaoriChief.html]
742
743X http://www.eyecontactsite.com, [Lots of names. And a few short sentences or words possibly in comments.]
744X http://eartheum.com, [Rapa Nui, Easter Island related content. Misdetected]
745X http://www.steve-wheeler.co.uk, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"]
746X https://chromium.googlesource.com, [some source code related to languages' two letter codes]
747
748X http://www.roadsmile.com, [Lots of misdetection based on word Kia.]
749?? https://www.knowatom.com, https://phet.colorado.edu [Similar looking science web sites for children. Uses auto-translation?]
750
751X https://www.indexmundi.com, [place names. Pages about Solomon Islands. Misdetection of placenames.]
752
753
754
755X http://wowwars.net, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ]
756?? https://www.hidroponia.org.mx, [Not sure if https://www.hidroponia.org.mx/index.php/idiomas/284-hydroponics-te-ahurea-wai-maori is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?]
757X http://www.v3whois.com, [URLs are misdetected as MRI]
758X http://rhymebrain.com, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages]
759
760
761X SINGLE SENTENCE DETECTED (NO MORE AND NOT WHOLE PAGE isMRI:)
762 http://frontrowphotos.com,
763 http://www.pressreader.com,
764 https://www.nccri.ie,
765 http://takethatvacation.com,
766 http://worldradiomap.com,
767 http://www.namesdir.com,
768
769 X http://www.frogsonline.com, [NZ hotels, placenames]
770 X http://www.geni.com, [Single sentence misdetection]
771 X http://wikiedit.org, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui]
772
773
774
775TOTALS:
776US: 25
777AU: 2
778DE: 2
779DK: 2
780BG: 1
781CZ: 1
782ES: 1
783FR: 1
784IE: 1
785TOTAL: 213
786
787------------------------------------------------
7882. Need to inspect all those sites with any webPAGE that has mi in its URL path (mi.* or */mi) that are not sites with nz TLD or originating in NZ:
789
790db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
791472
792
793(vs:
794db.getCollection('Websites').find({$and: [{numPagesInMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
795209)
796
797
798db.Websites.aggregate([
799 {
800 $match: {
801 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]
802 }
803 },
804 {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}},
805 { $sort : { count : -1} }
806])
807
808Also excluding AU, since we dealt with that already in step A1:
809 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$not: /NZ|AU/}}]}).count()
810= 471
811
812db.Websites.aggregate([
813 {
814 $match: {
815 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$not: /NZ|AU/}}]
816 }
817 },
818 {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}},
819 { $sort : { count : -1} }
820])
821
822
823 /* 1 */
824 {
825 "_id" : "US",
826 "count" : 305.0,
827 "domain" : [
828 "http://www.qjfiberglass.com",
829 "http://www.sokenswitch.com",
830 "https://follow3rs.com",
831 "http://www.zjnbzy.com",
832 "http://www.nbbvc.com",
833 "https://mamaclub.info",
834 "https://www.nbwinwinea.com",
835 "https://www.sdspraybooth.com",
836 "https://www.jlextract.com",
837 "http://www.artmetalcn.com",
838 "http://www.steel-in-china.com",
839 "http://www.sunnymaycn.com",
840 "https://www.weld-automation.com",
841 "http://www.homewin88.com",
842 "http://indigenousblogs.com",
843 "https://biblia.gospelprime.com.br",
844 "http://www.jhc-nonwoven.com",
845 "http://www.forever-moving.com",
846 "http://www.eternal-friendship.com",
847 "http://www.conele-mixer.com",
848 "http://binaryoptionsindicators.com",
849 "http://www.mao-shuo.com",
850 "http://www.bmaxmachine.com",
851 "http://www.sdxhhd.com",
852 "http://lingeriefc.com",
853 "http://infomutt.com",
854 "https://jobdescriptionsample.org",
855 "http://mi.hongwugas.com",
856 "http://www.lishin.cc",
857 "http://www.mksmartcard.com",
858 "http://www.toption-ingredients.com",
859 "http://www.sps-squeegee.com",
860 "https://facebook.roseconverter.com",
861 "https://www.bestpvcfence.com",
862 "http://www.tubemillcn.com",
863 "http://www.jindunlaobao.com",
864 "http://www.wavesspring.com",
865 "http://www.restart-industry.com",
866 "https://mi.nyecountdown.com",
867 "http://www.ycautoc.com",
868 "http://www.htwindsolarpower.com",
869 "http://www.joyseaplywood.com",
870 "http://www.teda-hydraulic.com",
871 "http://www.gmk-valve.com",
872 "https://usahello.org",
873 "https://www.datemypet.com",
874 "https://worldstarhiphop.roseconverter.com",
875 "http://www.sdtzgloves.com",
876 "https://www.airpullfilter.com",
877 "http://www.bestwaytowhitenteethguide.org",
878 "https://mi.m.wikipedia.org",
879 "http://www.analiabriz.com",
880 "http://www.xida-electronics.com",
881 "http://www.weldpipemill.com",
882 "http://www.ictctruss.com",
883 "https://www.junschem.com",
884 "https://www.judinwire.net",
885 "http://www.mtpak.com",
886 "http://www.nantaidiesel.com",
887 "https://www.nbkeming.com",
888 "http://www.windsolarchina.com",
889 "http://www.gormeet.com",
890 "http://www.wf-fastener.com",
891 "http://www.pressurelantern.com",
892 "https://www.drickinstruments.com",
893 "http://milfsplease.com",
894 "http://www.risepipe.com",
895 "https://www.yourcloudlibrary.com",
896 "http://portal.smart-project.info",
897 "http://cdn.centrallanguageschool.com",
898 "http://www.shengxinsport.com",
899 "https://www.tianjia-lock.com",
900 "http://www.julongjewelry.cn",
901 "https://www.yogemcasting.com",
902 "http://www.chinaocan.com",
903 "http://www.autosunsoul.com",
904 "https://www.prostepper.com",
905 "https://www.pldyes.com",
906 "http://www.nicerelay.com",
907 "https://www.sinodryair.com",
908 "https://www.risenltd.com",
909 "http://www.albertnovosino.com",
910 "http://www.ttyzfilter.com",
911 "http://www.bst-elecs.com",
912 "http://www.hqftex.com",
913 "http://www.kehengmixing.com",
914 "http://loginmail.online",
915 "http://www.bluekin.com",
916 "https://blondewebcamgirl.com",
917 "https://www.nickel-alloy.net",
918 "https://www.hjfoodmachinery.com",
919 "http://csunplugged.org",
920 "https://www.inpnurseryproducts.com",
921 "http://www.americasportsfloor.com",
922 "http://www.dmdryer.com",
923 "http://www.sindadisplay.com",
924 "http://www.focusway-casting.com",
925 "https://mi.centr-zashity.ru",
926 "https://www.axnewdisplay.com",
927 "https://cycletraderpro.com",
928 "https://vk.roseconverter.com",
929 "https://www.livehoster.com",
930 "http://www.cnsongben.com",
931 "http://www.mytrickstips.com",
932 "http://www.quickcncmachine.com",
933 "http://www.arjextrailerparts.com",
934 "http://www.shshenyong.com",
935 "http://www.pvcroofingtile.com",
936 "http://www.wrdtubemill.com",
937 "http://church-of-christ.org",
938 "https://www.td-casting.com",
939 "https://www.hengweihoseclamp.com",
940 "http://www.wpcline.com",
941 "https://www.kubbamachine.com",
942 "http://www.goethe.de",
943 "https://www.njkeyuda.com",
944 "http://www.prostepper.com",
945 "http://www.cnfeinade.com",
946 "http://www.huamachinery.com",
947 "http://www.damiser.com",
948 "http://www.shanghailangzhiweld.com",
949 "http://www.fanhaopets.com",
950 "https://blockchains.io",
951 "http://www.inpnurseryproducts.com",
952 "http://www.yixinhetrade.com",
953 "http://www.newbaoquan.com",
954 "https://mi.lawyers.cafe",
955 "http://www.shenhe-bearing.com",
956 "http://atoall.com",
957 "http://www.vango-tech.com",
958 "https://www.gigalight.com",
959 "http://www.ladybagcn.com",
960 "http://www.tjcywires.com",
961 "http://www.vigor-industry.com",
962 "http://www.litbright-candles.com",
963 "http://www.nide-industry.com",
964 "http://www.cnfreda.com",
965 "http://www.jbpcba.com.cn",
966 "http://www.qitai-adhesive.com",
967 "http://www.weld-automation.com",
968 "http://www.cnyaonan.com",
969 "http://www.ruifeng-leather.com",
970 "http://www-hotmail-com.email",
971 "http://www.jointcontrols.net",
972 "https://twitter.roseconverter.com",
973 "https://www.aquagem.com.cn",
974 "http://www.seasum.cn",
975 "http://www.steelprotectionpack.com",
976 "http://www.suoxuehuwai.com",
977 "http://www.sunshinebelt.com",
978 "http://www.nyforgedwheels.com",
979 "http://www.amcbox.com",
980 "http://www.livepro-beauty.com",
981 "http://www.nbyobo.com",
982 "http://www.chinacarbonfibre.com",
983 "https://guidebooq.com",
984 "https://www.hello4x4.com",
985 "http://www.zhonghe222.com",
986 "http://www.church-of-christ.org",
987 "https://www.czzhit.com",
988 "https://www.king-pcb.com",
989 "http://www.secondhormone.com",
990 "http://www.sxceramic.com",
991 "http://www.hobbycarbon.com",
992 "http://www.bdknitting.com",
993 "http://www.ntvigourbrush.com",
994 "http://www.china-brewhouse.com",
995 "http://mi.tccasdic.com",
996 "http://www.hzhinew.com",
997 "http://www.silicone-odm.com",
998 "http://www.liweimetal.com",
999 "http://www.huaxinfurnace.com",
1000 "http://www.envicool.net",
1001 "http://www.cnxh-electric.com",
1002 "http://www.jiejingfactory.com",
1003 "http://www.longda-inc.com",
1004 "http://www.pamaens.com",
1005 "http://www.sdcncrouter.com",
1006 "http://www.tkfanen.com",
1007 "http://www.touchdisplays-tech.com",
1008 "http://www.twtvalvecn.com",
1009 "http://www.weddingfurniture.com",
1010 "https://www.huadongmedical.com",
1011 "http://www.ledecofr.com",
1012 "http://www.rosin-kings.com",
1013 "http://www.aluminum-profiles-supplier.com",
1014 "http://www.cannapresso.com",
1015 "https://www.cz-juteng.com",
1016 "http://www.strongsaw.com",
1017 "http://jobdescriptionsample.org",
1018 "http://www.btmeac.com",
1019 "http://www.nicehut-window.com",
1020 "http://www.accotech.net",
1021 "https://www.dshprecision.com",
1022 "http://www.gemnice.com",
1023 "http://www.richina-tools.com",
1024 "http://www.brushcutterjusen.com",
1025 "http://www.szhaiwang.com",
1026 "https://www.conele-mixer.com",
1027 "https://www.tkthvac.com",
1028 "http://technobuzzer.com",
1029 "https://www.csunplugged.org",
1030 "http://www.ainuogas.com",
1031 "https://policies.oclc.org",
1032 "http://www.xfinsulation.com",
1033 "http://www.lanlinprintech.com",
1034 "http://www.yrkseal.com",
1035 "http://www.jpslurrypump.com",
1036 "http://www.soontruepackaging.com",
1037 "http://www.shengrunqiche.com",
1038 "http://www.luluae.com",
1039 "https://www.judipak.com",
1040 "http://www.cz-juteng.com",
1041 "http://www.jiajiebathmirror.com",
1042 "http://www.bigrollscloth.com",
1043 "http://www.chinatopcnc.com",
1044 "https://drugsinc.eu",
1045 "http://www.wosaicabinet.com",
1046 "http://www.wellfit-sportswear.com",
1047 "http://www.pxbaisheng.com",
1048 "http://www.meihua-wm.com",
1049 "http://www.wzdongyi.com",
1050 "http://www.kd-physicalrehab.com",
1051 "http://www.longs-motor.com",
1052 "https://www.samsungwiremesh.com",
1053 "http://www.wellformpacking.com",
1054 "http://www.hs-stationery.com",
1055 "http://www.allutertech.com",
1056 "http://www.czzhit.com",
1057 "http://www.jlgrating.com",
1058 "http://www.qbd-group.com",
1059 "http://www.evaescort.net",
1060 "https://dwsolo.com",
1061 "http://www.chuamotor.com",
1062 "http://www.ksdoing.com",
1063 "http://mi.broadcastbeat.com",
1064 "http://www.czldfloor.com",
1065 "http://www.qypaperbox.com",
1066 "https://mi.wikipedia.org",
1067 "http://www.houshenshoes.com",
1068 "http://www.xzc9.com",
1069 "http://www.chinacombinerbox.com",
1070 "https://www.everfineplastics.com",
1071 "http://www.sinemagnetic.com",
1072 "http://www.linphos.com",
1073 "https://www.rikoooo.com",
1074 "http://www.ncpcpharma.com",
1075 "http://www.evergrowingcage.com",
1076 "http://www.qxmic.com",
1077 "https://www.fxcc.com",
1078 "http://www.ldsolarpv.com",
1079 "http://mytrickstips.com",
1080 "http://www.linbaymachinery.com",
1081 "http://www.photoprofix.com",
1082 "http://www.supplyfurniture.com",
1083 "http://www.honglu-mining.com",
1084 "http://www.szebo.com",
1085 "http://www.cnrgxy.com",
1086 "http://blicanada.net",
1087 "http://www.homey-tec.com",
1088 "http://www.whties.com",
1089 "http://www.zhenchengscrew.com",
1090 "http://www.ruk-tech.com",
1091 "http://www.longxin-global.com",
1092 "https://www.tymexnetting.com",
1093 "http://www.chinabosun.com",
1094 "http://www.b-packaging.com",
1095 "http://www.ncpcvet.com",
1096 "https://mi.kidspicturedictionary.com",
1097 "http://mi.guoguangelectric.com",
1098 "http://topbitcoincard.com",
1099 "https://atoall.com",
1100 "http://www.acouplefortheroad.com",
1101 "http://www.tongyujiaju.com",
1102 "http://www.chinapipemills.com",
1103 "http://www.infomutt.com",
1104 "http://www.fxctool.com",
1105 "http://www.samewe.net",
1106 "https://www.aquark.com.cn",
1107 "https://www.artiegarden.com",
1108 "http://www.fxpremiere.com",
1109 "http://www.sog-pump.com",
1110 "http://www.omnicnc.com",
1111 "https://www.waterproof-factory.com",
1112 "http://www.wanmaroto.com",
1113 "http://mi.gmpmetalwork.com",
1114 "https://www.webhostingsecretrevealed.net",
1115 "http://www.gecko-kalimba.com",
1116 "https://www.glorystarlaser.com",
1117 "http://www.viairdoormat.com",
1118 "https://vimeo.roseconverter.com",
1119 "https://www.fctele.com",
1120 "http://www.hzzjair.com",
1121 "https://2fish.co",
1122 "http://www.qymachines.com",
1123 "http://www.chinachairtable.com",
1124 "http://www.gfh-electric.com",
1125 "http://www.tangres100.com",
1126 "https://www.valve-pipe-fitting.com",
1127 "http://www.fancyco.com",
1128 "http://www.zhengmaoelec.com",
1129 "http://www.chinagxmy.com",
1130 "https://www.tjshenzhoutong.com",
1131 "https://maxspeedtest.com"
1132 ]
1133 }
1134
1135 /* 2 */
1136 {
1137 "_id" : "CN",
1138 "count" : 113.0,
1139 "domain" : [
1140 "https://www.fibereye2.com",
1141 "https://www.outstandingdm.com",
1142 "https://www.szradiant.com",
1143 "http://www.gmmdjx.com",
1144 "http://www.likvchina.com",
1145 "https://www.abdindustrial.com",
1146 "https://www.c-superun.com",
1147 "https://www.slagremoving.com",
1148 "https://www.sino-masterbatch.com",
1149 "http://www.cntiescarf.com",
1150 "https://www.dm-compressor.com",
1151 "https://www.szhtpmart.com",
1152 "https://www.phhydraulic.com",
1153 "https://www.imposalight.com",
1154 "https://www.medke.com",
1155 "http://www.eburn-burner.com",
1156 "https://www.haitungchem.com",
1157 "http://www.medicohongkong.com",
1158 "http://www.koowheel.com",
1159 "https://www.aerial-display.com",
1160 "https://www.cntfsolar.com",
1161 "https://www.aoxinhvacr.com",
1162 "https://www.diamante-tech.com",
1163 "https://www.richest-group.com",
1164 "http://www.world-starter.com",
1165 "http://www.goldenlaser.cc",
1166 "https://www.km-medicine.com",
1167 "https://www.safesworld.com",
1168 "https://www.peptidejymed.com",
1169 "https://www.nbhengchen.com",
1170 "https://www.xinyuesteel.com",
1171 "https://www.charmingmetal.com",
1172 "https://www.lasonparts.com",
1173 "https://www.ngyc.com",
1174 "https://www.pacopower.com",
1175 "https://www.tjtgsteel.com",
1176 "http://www.abdindustrial.com",
1177 "https://www.yangrutingtrade.com",
1178 "http://www.wedacdisplays.com",
1179 "https://www.gaofeng-petro.com",
1180 "https://www.ez-walk.com",
1181 "https://www.szzhsbag.com",
1182 "https://www.simphoenix.com",
1183 "http://www.focuslasersystems.com",
1184 "https://www.fc-med.com",
1185 "http://www.zypackag.com",
1186 "http://www.kavounautoparts.com",
1187 "https://www.foocles.com",
1188 "https://www.jsjlmachinery.com",
1189 "https://www.special-metal.com",
1190 "https://www.bestardoors.com",
1191 "http://www.wenwencf.com",
1192 "https://www.insharevape.com",
1193 "https://www.dghk-buffer.com",
1194 "https://www.n2o2gas.com",
1195 "https://www.changjia-machinery.com",
1196 "https://www.nfyo.com",
1197 "http://www.estarspareparts.com",
1198 "https://www.jsbotanics.com",
1199 "https://www.chinarfidcard.com",
1200 "https://www.sjzhgw.com",
1201 "https://www.study-mandarin.com",
1202 "https://www.qdruidetai.com",
1203 "https://www.zhongxinlighting.com",
1204 "http://www.qjqdvalve.com",
1205 "https://www.painting-machine.com",
1206 "https://www.bescatray.com",
1207 "https://www.tianseoffice.com",
1208 "https://www.herbal-ingredients.com",
1209 "https://www.qlart.com",
1210 "https://www.sehenda-en.com",
1211 "https://www.egbadges.com",
1212 "http://www.eudemonbaby.com",
1213 "http://www.3drambery.com",
1214 "https://www.chinawelken.com",
1215 "http://www.jsbotanics.com",
1216 "https://www.rswires.com",
1217 "https://www.zjyongqi.com",
1218 "https://www.micropreparedslides.com",
1219 "http://www.longtopmining.com",
1220 "https://www.rykay.com",
1221 "https://www.sdtoplit.com",
1222 "https://www.wecare-life.com",
1223 "http://www.wigglewires.com",
1224 "https://www.grandstarcn.com",
1225 "https://www.bailixin.com",
1226 "http://www.refinehotelsupply.com",
1227 "http://www.prius-automatic.com",
1228 "https://www.nbulboy.com",
1229 "https://www.jy-glass.com",
1230 "http://www.ankaicnc.com",
1231 "https://www.band-ss.com",
1232 "https://www.hytokstech.com",
1233 "https://www.goldnard.com",
1234 "http://www.comfortebicycle.com",
1235 "https://www.zengrit.com",
1236 "https://www.3drambery.com",
1237 "https://www.pakite.com",
1238 "https://www.xianglin-plastics.com",
1239 "https://www.inductorchina.com",
1240 "https://www.nbjiatong.com",
1241 "https://www.bofanpc.com",
1242 "https://www.sakysteel.com",
1243 "http://www.coneleqd.com",
1244 "https://www.jewellrylove.com",
1245 "http://www.nbwellrun.com",
1246 "http://www.yulong-cellulose-cmc.com",
1247 "https://www.aootan.com",
1248 "https://www.coffbrewing.com",
1249 "http://www.jetwayamenities.com",
1250 "https://english.taiergroup.com",
1251 "http://www.czhengfa.com",
1252 "https://www.sitzonechair.com"
1253 ]
1254 }
1255
1256 /* 3 */
1257 {
1258 "_id" : "FR",
1259 "count" : 19.0,
1260 "domain" : [
1261 "https://www.slotsltd.com",
1262 "https://mi.apicmo.com",
1263 "http://mi.psychicbonus.com",
1264 "http://mi.aasraw.com",
1265 "https://mi.hyperbaric-chamber.com",
1266 "https://mi.usa-casino-online.com",
1267 "https://mi.gem.agency",
1268 "https://mi.hghphuket.com",
1269 "https://mi.mehmetdursun.av.tr",
1270 "https://mi.mhthread.com",
1271 "https://mi.phcoker.com",
1272 "https://www.casino.uk.com",
1273 "https://www.planetkeyboard.com",
1274 "http://mi.outboard-boat-motor-repair.com",
1275 "http://www.gpedia.com",
1276 "http://mi.fitnessrebates.com",
1277 "https://www.expresscasino.com",
1278 "https://mi.petrpikora.com",
1279 "https://mi.isearch.de"
1280 ]
1281 }
1282
1283 /* 4 */
1284 {
1285 "_id" : "DE",
1286 "count" : 8.0,
1287 "domain" : [
1288 "https://herocity.de",
1289 "https://traynews.com",
1290 "http://www.almancax.com",
1291 "https://transposh.org",
1292 "http://transposh.org",
1293 "https://mi.vessoft.com",
1294 "https://www.saper-link-news.com",
1295 "https://afrikhepri.org"
1296 ]
1297 }
1298
1299 /* 5 */
1300 {
1301 "_id" : "NL",
1302 "count" : 6.0,
1303 "domain" : [
1304 "http://www.martinvrijland.nl",
1305 "https://realtytenerife.com",
1306 "https://www.bitbybitbook.com",
1307 "https://www.emergency-live.com",
1308 "http://www.cbdolievoordelen.nl",
1309 "http://www.spectrumschool.be"
1310 ]
1311 }
1312
1313 /* 6 */
1314 {
1315 "_id" : "CA",
1316 "count" : 5.0,
1317 "domain" : [
1318 "https://www.wikiplanet.click",
1319 "https://cloudsfeed.com",
1320 "http://newsrule.com",
1321 "http://dehaut.com",
1322 "https://www.chinanbdb.com"
1323 ]
1324 }
1325
1326 /* 7 */
1327 {
1328 "_id" : "HK",
1329 "count" : 2.0,
1330 "domain" : [
1331 "https://www.desunpump.com",
1332 "http://www.10turntables.com"
1333 ]
1334 }
1335
1336 /* 8 */
1337 {
1338 "_id" : "UA",
1339 "count" : 2.0,
1340 "domain" : [
1341 "http://ukraine.admission.center",
1342 "http://umsa.admission.center"
1343 ]
1344 }
1345
1346 /* 9 */
1347 {
1348 "_id" : "GB",
1349 "count" : 2.0,
1350 "domain" : [
1351 "https://www.centrallanguageschool.com",
1352 "https://www.solasolv.com"
1353 ]
1354 }
1355
1356 /* 10 */
1357 {
1358 "_id" : "UNKNOWN",
1359 "count" : 2.0,
1360 "domain" : [
1361 "https://mi.buyaas.com",
1362 "http://en.wiki.wintoflash.com"
1363 ]
1364 }
1365
1366 /* 11 */
1367 {
1368 "_id" : "ES",
1369 "count" : 1.0,
1370 "domain" : [
1371 "https://www.torresbus.es"
1372 ]
1373 }
1374
1375 /* 12 */
1376 {
1377 "_id" : "IE",
1378 "count" : 1.0,
1379 "domain" : [
1380 "http://netkiosk.co.uk"
1381 ]
1382 }
1383
1384 /* 13 */
1385 {
1386 "_id" : "RU",
1387 "count" : 1.0,
1388 "domain" : [
1389 "http://www.treningmozga.com"
1390 ]
1391 }
1392
1393 /* 14 */
1394 {
1395 "_id" : "SG",
1396 "count" : 1.0,
1397 "domain" : [
1398 "https://omg-solutions.com"
1399 ]
1400 }
1401
1402 /* 15 */
1403 {
1404 "_id" : "JP",
1405 "count" : 1.0,
1406 "domain" : [
1407 "https://forexmania.org"
1408 ]
1409 }
1410
1411 /* 16 */
1412 {
1413 "_id" : "EU",
1414 "count" : 1.0,
1415 "domain" : [
1416 "http://www.the-good-stuff-factory.be"
1417 ]
1418 }
1419
1420 /* 17 */
1421 {
1422 "_id" : "TR",
1423 "count" : 1.0,
1424 "domain" : [
1425 "https://www.elitedeluxe.com.tr"
1426 ]
1427 }
1428
1429
1430First, I eyeballed and excluded all obvious product sites which are automatically translated.
1431
1432Of interest or possible interest remain the following, grouped per country of site origin:
1433
1434US:
1435!! http://indigenousblogs.com [15/18 blogs work] - has one page in Maori (http://indigenousblogs.com/feeds/mi.xml)
1436X https://biblia.gospelprime.com.br - misdetection (containsMRI)
1437X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout
1438!! https://mi.m.wikipedia.org, https://mi.wikipedia.org
1439X https://usahello.org - autotranslated
1440X http://church-of-christ.org, http://www.church-of-christ.org - I think autotranslated, because "HET kerken van Christus" at https://church-of-christ.org/nl/ i.p.v. meervoud DE
1441X https://www.livehoster.com
1442X http://www.americasportsfloor.com, - product store. Misdetected
1443!! http://csunplugged.org, https://www.csunplugged.org - University of Canterbury NZ and site only available in EN, MI, DE, ES, CN
1444X https://mi.lawyers.cafe - autotranslated
1445 X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated
1446! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated.
1447X http://jobdescriptionsample.org - autotranslated
1448X http://mi.broadcastbeat.com - autotranslated product site
1449X http://www.samewe.net - autotranslated product site
1450X https://mi.kidspicturedictionary.com - autotranslated, but MAY BE USEFUL
1451X https://www.rikoooo.com - autotranslated
1452
1453CN: -
1454
1455FR:
1456? https://mi.phcoker.com - product site "Shangke Chemical Rapu + 86 (1812) 4514114 [email protected]"
1457X http://www.gpedia.com - dodgy copy of wikipedia, see http://www.gpedia.com/nl/gpedia/Hoofdpagina
1458
1459NL:
1460X http://www.martinvrijland.nl - wordpress, autotranslated
1461
1462CA:
1463X https://www.wikiplanet.click (seems like a dodgy copy of wikipedia)
1464X cloudsfeed.com - wordpress admin page
1465
1466
1467db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]})
1468=> http://indigenousblogs.com/mi/
1469
1470
1471TOTAL: Only 4 sites contain genuine MRI sentences that aren't automatically translated out of all non-NZ/non-AU sites that have "mi" in a webpage's URL path.
1472
1473
1474TOTALS:
1475US: 25+4 from US with mi in URL path = 29
1476AU: 2
1477DE: 2
1478DK: 2
1479BG: 1
1480CZ: 1
1481ES: 1
1482FR: 1
1483IE: 1
1484TOTAL: 213+4 from US with mi in URL path = 216
1485------------------------------------------------
1486B. NEW ZEALAND SITES: NZ origin + .nz TLD SITES
1487------------------------------------------------
14881. Get NZ sites numPagesContainingMRI > 0
1489
1490db.Websites.aggregate([
1491 {
1492 $match: {
1493 $and: [
1494 {numPagesContainingMRI: {$gt: 0}},
1495 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1496 ]
1497 }
1498 },
1499 { $unwind: "$geoLocationCountryCode" },
1500 {
1501 $group: {
1502 _id: "nz",
1503 count: { $sum: 1 },
1504 domain: { $addToSet: '$domain' },
1505 numPagesInMRICount: { $sum: '$numPagesInMRI' },
1506 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
1507 }
1508 },
1509 { $sort : { count : -1} }
1510]);
1511
1512/* 1 */
1513{
1514 "_id" : "nz",
1515 "count" : 176.0,
1516 "domain" : [
1517!! "http://pukekohe.directorybusiness.co.nz", 0/2, 0/2, isMRI = 0!!
1518 "http://maori.livingheritage.org.nz", 2/2 2/2
1519 "http://pukoro.co.nz", 2/2 0/2
1520 "http://www.rakaumanga.school.nz", 0/4 0/4
1521 "http://www.ngamanawainc.co.nz", 0/2 0/2
1522 "https://office.e-agent.nz",
1523 "https://www.components-mart.nz",
1524 "http://tmmkkm.school.nz",
1525 "http://www.rotoruanz.com",
1526 "http://www.huri-translations.pf",
1527 "https://admin.teara.govt.nz",
1528 "http://hangaraumatihiko.tki.org.nz",
1529 "https://sexualviolence.victimsinfo.govt.nz",
1530 "http://www.tekura.school.nz",
1531 "http://philipbeadle.co.nz",
1532 "http://www.cs.waikato.ac.nz",
1533 "https://www.hapuhauora.health.nz",
1534 "http://cms.sunsmartschools.co.nz",
1535 "https://keepourmoneyclean.govt.nz",
1536 "http://www.kura-porirua.school.nz",
1537 "http://waitarahistory.org.nz",
1538 "http://oilcrash.com",
1539 "http://videos.e-agent.nz",
1540 "https://manawatuheritage.pncc.govt.nz",
1541 "https://www.terakipaewhenua.school.nz",
1542 "http://dev.nzpcn.org.nz",
1543 "https://kotahimiriona.co.nz",
1544 "http://kurakokiri.maori.nz",
1545 "https://www.sporty.co.nz",
1546 "http://kaupare.co.nz",
1547 "http://ngatiporoukiponeke.org.nz",
1548 "https://www.takitimu.ac.nz",
1549 "http://www.tetaurawhiri.govt.nz",
1550 "http://www.waiata.maori.nz",
1551 "http://conference.tpwt.maori.nz",
1552 "http://ngatiwhakaue.iwi.nz",
1553 "http://www.nzpcn.org.nz",
1554 "http://www.ruralfind.co.nz",
1555 "https://www.dnc.org.nz",
1556 "https://www.puau.school.nz",
1557 "https://kaiiwicamp.nz",
1558 "https://www.terito.school.nz",
1559 "https://www.pinterest.nz",
1560 "https://e-ako-pangarau.nzmaths.co.nz",
1561 "http://givealittle.co.nz",
1562 "https://teaomaori.news",
1563 "https://www.korokikahukura.co.nz",
1564 "http://myfathersworld.net.nz",
1565 "http://www.firstworldwar.tki.org.nz",
1566 "https://www.ashtangatauranga.co.nz",
1567 "http://biketorqueyamaha.co.nz",
1568 "https://www.rereahu.maori.nz",
1569 "http://www.tewikiotereomaori.co.nz",
1570 "http://www.brettgraham.co.nz",
1571 "http://tewikiotereomaori.nz",
1572 "http://anglicanprayerbook.nz",
1573 "http://arataua.nz",
1574 "http://blog.teara.govt.nz",
1575 "http://www.otepoti.school.nz",
1576 "http://www.kmk.maori.nz",
1577 "http://www.eventcinemas.co.nz",
1578 "https://www.stats.govt.nz",
1579 "http://www.oag.govt.nz", 2/2 0/2
1580 "http://whatonga.school.nz",
1581 "http://www.tewhanake.maori.nz",
1582 "https://www.maoritelevision.com",
1583 "http://kuraaiwi.maori.nz",
1584 "http://kurataiao.tki.org.nz",
1585 "http://teaohou.natlib.govt.nz",
1586 "http://www.tetaumuturunanga.iwi.nz",
1587 "http://www.tasteofplenty.co.nz",
1588 "http://community.nzdl.org",
1589 "https://www.blushandbrows.nz",
1590 "https://register.tpota.org.nz",
1591 "https://cdn.tehiku.nz",
1592 "http://www.wcl.govt.nz",
1593 "http://www.jeremybaker.nz",
1594 "http://punareo.co.nz",
1595 "https://rapuatearatika.education.govt.nz",
1596 "http://www.kurakokiri.maori.nz",
1597 "https://www.cruisetourstauranga.co.nz",
1598 "https://sooty.nz",
1599 "http://rakaumanga.school.nz",
1600 "https://tiritiowaitangi.govt.nz",
1601 "http://www.tmoa.tki.org.nz",
1602 "http://www.w3vietnam.org.nz",
1603 "https://www.infinite-electronic.nz",
1604 "https://www.komako.org.nz",
1605 "http://nzpostcard.co.nz",
1606 "http://artizani.co.nz",
1607 "http://www.finlaysonpark.school.nz",
1608 "http://crimson.co.nz",
1609 "http://holyspirit.nz",
1610 "http://www.tkkmmokopuna.school.nz",
1611 "http://www.pakanae.maori.nz",
1612 "http://www.teipukarea.maori.nz",
1613 "http://archerpix.com",
1614 "https://2019.nethui.nz",
1615 "http://www.kupengahao.co.nz",
1616 "https://www.lcds-display.nz",
1617 "http://waiata.maori.nz",
1618 "http://kuraproductions.co.nz",
1619 "http://www.biketorqueyamaha.co.nz",
1620 "http://www.livingheritage.org.nz",
1621 "http://www.zoomin.co.nz",
1622 "http://rsnz.natlib.govt.nz",
1623 "http://otorohanga.directorybusiness.co.nz",
1624 "http://reoora.co.nz",
1625 "http://w3vietnam.org.nz",
1626 "https://rehuamarae.co.nz",
1627 "https://www.electionresults.org.nz",
1628 "https://www.ngamanawainc.co.nz",
1629 "https://www.rotorua-rafting.co.nz",
1630 "https://www.taitokerautrust.org.nz",
1631 "https://www.wingspan.co.nz",
1632 "http://www.kkmmaungarongo.co.nz",
1633 "http://kete.wcl.govt.nz",
1634 "http://www.heartland.co.nz",
1635 "http://www.electionresults.govt.nz",
1636 "https://www.tematawai.maori.nz",
1637 "http://hana.co.nz",
1638 "http://www.tereowrap.nz",
1639 "http://rurued.school.nz",
1640 "http://www.twtop.school.nz",
1641 "http://rexedra.gen.nz",
1642 "http://archive.stats.govt.nz",
1643 "https://liveresults.co.nz",
1644 "https://www.e-agent.nz",
1645 "http://tiritiowaitangi.govt.nz",
1646 "http://www.hrc.co.nz",
1647 "http://animations.tewhanake.maori.nz",
1648 "https://interactives.stuff.co.nz",
1649 "http://avonside.net",
1650 "http://www.methodist.org.nz",
1651 "https://www.tasteofplenty.co.nz",
1652 "http://www.maoriinvestments.co.nz",
1653 "https://m.wairarapatv.co.nz",
1654 "http://www.gans.co.nz",
1655 "https://ttw1.cwp.govt.nz",
1656 "http://ngarauhuia.ngatiapakiterato.iwi.nz",
1657 "https://www.tuiatematangi.ac.nz",
1658 "http://tetaurawhiri.govt.nz",
1659 "http://maori.tki.org.nz",
1660 "http://www.topomap.co.nz",
1661 "https://www.puhaandpakeha.co.nz",
1662 "https://haereheikaiako.co.nz",
1663 "https://paekupu.co.nz",
1664 "https://curriculumtool.education.govt.nz",
1665 "http://firstworldwar.tki.org.nz",
1666 "http://www.28maoribattalion.org.nz",
1667 "https://hepatakakupu.nz",
1668 "https://www.zenbu.co.nz",
1669 "http://www.matarikifestival.org.nz",
1670 "http://pukapuka.nz",
1671 "http://ngatipahauwera.co.nz", 2/2 2/2
1672 "http://southerntribes.co.nz",
1673 "https://player.vimeo.com",
1674 "http://tmoa.tki.org.nz",
1675 "http://www.writersfestival.co.nz",
1676 "http://talkingtothecan.com",
1677 "https://www.whanau-tahi.school.nz",
1678 "http://satellites.co.nz",
1679 "http://auturoa.nz",
1680 "http://www.tuwharetoa.iwi.nz",
1681 "http://kmpmusic.co.nz",
1682 "http://www.temarareo.org",
1683 "http://archive.electionresults.govt.nz",
1684 "http://kaiiwicamp.nz",
1685 "http://tehauora.org.nz",
1686 "http://temahurehure.maori.nz",
1687 "http://www.runanga.co.nz"
1688 ],
1689 "numPagesInMRICount" : 4360,
1690 "numPagesContainingMRICount" : 9641
1691}
1692
1693
1694NZ sites where pages are detected as being overall inMRI are more likely to contain at least one sentence inMRI.
1695Therefore, for the purpose of making the manual task of going through all NZ sites a bit easier,
1696will work with 2 query results that combine into the above:
1697- those NZ pages where numPagesInMRI > 0
1698- and the remaining NZ pages that only contain MRI (numPagesInMRI = 0 but numPagesContainingMRI > 0)
1699
1700----------------------------
1701
17022. Get NZ sites where numPagesInMRI > 0
1703
1704db.Websites.aggregate([
1705 {
1706 $match: {
1707 $and: [
1708 {numPagesInMRI: {$gt: 0}},
1709 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1710 ]
1711 }
1712 },
1713 { $unwind: "$geoLocationCountryCode" },
1714 {
1715 $group: {
1716 _id: "nz",
1717 count: { $sum: 1 },
1718 domain: { $addToSet: '$domain' },
1719 numPagesInMRICount: { $sum: '$numPagesInMRI' },
1720 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
1721 }
1722 },
1723 { $sort : { count : -1} }
1724]);
1725
1726
1727Annotating the matching domain listing as follows:
1728* First column: n pages that are in MRI / n sampled isMRI pages
1729 To check a site contains a positive number of pages in MRI:
1730 db.getCollection('Webpages').find({URL:/teipukarea\.maori\.nz/, isMRI: true})
1731* Second column: n pages that do contain MRI / n sampled pages that are not isMRI yet contain MRI
1732 Can find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI.
1733 db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true})
1734
1735
1736/* 1 */
1737{
1738 "_id" : "nz",
1739 "count" : 96.0,
1740 "domain" : [
1741 "http://www.teipukarea.maori.nz", 3/3 1/3
1742 "http://ngatipahauwera.co.nz", 2/2, 2/2
1743 "http://www.oag.govt.nz", 2/2 0/2
1744 "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3
1745 "http://tmoa.tki.org.nz", 3/3 3/3
1746 "http://www.tewhanake.maori.nz", 3/3 2/3
1747 "http://www.matarikifestival.org.nz", 4/4 0/3
1748 "http://www.otepoti.school.nz", 3/3 0/4
1749!! "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages]
1750 "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages]
1751 "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence]
1752X!! "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI album and song titles] 0 [no other pages containsMRI]
1753 "http://maori.livingheritage.org.nz", 2/2 2/2 {includes: http://www.livingheritage.org.nz}
1754 "http://pukoro.co.nz", 2/2 0/2
1755X "https://register.tpota.org.nz", 0/1 [form] 0/2
1756+ "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz", 0/4, 1/3 [but audio content may be in MRI] But there are pages containing MRI to be found by non-random sampling of tehiku.nz, e.g. https://tehiku.nz/te-hiku-radio/te-tangihanga-ki-a-erima-henare/ contains MRI sentences
1757!! "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages]
1758! "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3
1759 "http://kurataiao.tki.org.nz", 3/3, 1/total 3
1760
1761!! "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages]
1762 "http://teaohou.natlib.govt.nz", 4/4, 2/4
1763 "http://www.tuwharetoa.iwi.nz", 2/3 0/3
1764X "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html)
1765 "https://www.terito.school.nz", 3/3, 0/2 total
1766 "https://ttw1.cwp.govt.nz", 3/3 3/3
1767 "https://www.whanau-tahi.school.nz", 4/4, 1/2 total
1768 "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total
1769 "https://teaomaori.news", 3/3, 0/1 total
1770 "http://tetaurawhiri.govt.nz", 3/3 /3/3 [Māori Language Commission site]
1771 "https://www.tuiatematangi.ac.nz", 4/4 3/3
1772 "http://animations.tewhanake.maori.nz", 3/3 3/3
1773!! "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages]
1774!! "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages]
1775 "http://www.28maoribattalion.org.nz", 3/3, 1/3
1776 "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3
1777 "http://www.brettgraham.co.nz", 1/1 total, 0/3
1778!! "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages]
1779
1780 "http://anglicanprayerbook.nz", 3/3 3/3
1781 "http://arataua.nz", 4/4, 2/3
1782 "http://maori.tki.org.nz", 3/3 3/3
1783DONE (with/out www): "http://www.firstworldwar.tki.org.nz",
1784X "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages]
1785 "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages]
1786 "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages]
1787 "https://curriculumtool.education.govt.nz", 4/4, 3/3
1788 "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] {includes: "http://www.kurakokiri.maori.nz"}
1789 "http://www.kkmmaungarongo.co.nz", 3/3, 3/3
1790 "http://www.heartland.co.nz", 3/3, 1/1 total
1791 "http://oilcrash.com", 2/2 total, 0/3
1792 "http://www.kura-porirua.school.nz", 4/4, 2/3
1793 "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages]
1794 "https://www.tematawai.maori.nz", 3/3, 3/3
1795
1796+ "https://www.terakipaewhenua.school.nz",
1797+ "http://www.tetaurawhiri.govt.nz",
1798+ "http://archive.stats.govt.nz", (1 page isMRI)
1799+ "http://tiritiowaitangi.govt.nz",
1800+!! "http://www.waiata.maori.nz", {includes: "http://waiata.maori.nz"}
1801+ "http://hana.co.nz", [crawled version of page contains MRI sentences, but current page is labelled for mobiles whereas in browser it just has a picture]
1802+ "http://kaupare.co.nz",
1803+ "http://www.tereowrap.nz",
1804?X "https://www.e-agent.nz", [autotranslated? SEO related site in NZ, Chinese, English and MRI] {includes: "https://office.e-agent.nz"}
1805 { included: "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] }
1806+ "http://www.hrc.co.nz",
1807+ "http://ngatiporoukiponeke.org.nz",
1808
1809+ "http://rurued.school.nz",
1810+ "http://www.twtop.school.nz",
1811X "https://www.infinite-electronic.nz", [autotranslated product site]
1812+!! "http://www.huri-translations.pf",
1813+ "https://admin.teara.govt.nz", e.g. https://admin.teara.govt.nz/mi/biographies/4m56/moko-pita-te-turuki-tamati {included: "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz, e.g. https://teara.govt.nz/mi/biographies/1t28/te-hapuku/media]}
1814+!! "https://tiritiowaitangi.govt.nz",
1815+ "http://www.tmoa.tki.org.nz",
1816+ "https://www.komako.org.nz", [no longer available, but crawled page had a paragraph in Maori with presumably a translation thereafter]
1817+ "http://www.wcl.govt.nz", {included: "http://kete.wcl.govt.nz" as wcl.govt.nz; 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3}
1818+!! "http://punareo.co.nz", [waiata]
1819
1820+ "https://rapuatearatika.education.govt.nz",
1821+ "http://tmmkkm.school.nz",
1822X "https://www.components-mart.nz", [autotranslated product site]
1823+ "http://www.cs.waikato.ac.nz", [Te Taka's pages!]
1824+!!! "http://www.kupengahao.co.nz", [MRI language books and resources]
1825+ "https://www.hapuhauora.health.nz", [Smokefree site. The one isMRI page has a proper sentence.]
1826X "https://www.lcds-display.nz", [autotranslated product site]
1827+ "http://cms.sunsmartschools.co.nz", [as sunsmartschools.co.nz, e.g. http://sunsmartschools.co.nz/kura/how-to-become.html]
1828+ "http://kuraproductions.co.nz",
1829+ "https://keepourmoneyclean.govt.nz", [1 page]
1830
1831+!! "http://www.tekura.school.nz",
1832+ "http://www.tkkmmokopuna.school.nz", [e.g. http://www.tkkmmokopuna.school.nz/newsletter_sets/newsletters/13-wahanga-4-wiki-5-he-puna-korero]
1833+ "http://hangaraumatihiko.tki.org.nz", [govt website. e.g. http://hangaraumatihiko.tki.org.nz/te-tupuranga-tangata-me-te-rorohiko/whakatupuranga-3-te-whakatau-tikanga-kawe-i-nga-mahi-whakaoti-hopanga/]
1834+ "http://www.pakanae.maori.nz"
1835 ],
1836 "numPagesInMRICount" : 4360,
1837 "numPagesContainingMRICount" : 7968
1838}
1839
1840
184196 sites detected as having isMRI pages - 7 sites/subdomains already included in the existing 96 = 89 sites.
1842
1843-2.5* product sites -2 non-MRI sites with songlistings or web forms etc
1844 *0.5 for e-agent.nz site
1845= 84.5 sites total that at least contain MRI, most have pages inMRI.
1846
1847We are excluding the one marked with ?X as it appears autotranslated.
1848In this set then, there are 84 sites that at least contain MRI out of 89 unique sites detected as containing pages inMRI.
1849
1850If not counting unique sites but counting the mongdb query result's subdomains separately: 84 +4 sites (non-unique or split over subdomains) in the result set contained MRI = 88 sites.
1851
1852----------------------------
1853
18543. Handling the remainder: NZ sites where numPagesInMRI = 0 BUT numPagesContainingMRI > 0
1855
1856The remainder = 80 NZ sites detected as not containing pages InMRI, but with positive number of pages detected as containsMRI:
1857
1858db.Websites.aggregate([
1859 {
1860 $match: {
1861 $and: [
1862 {numPagesContainingMRI: {$gt: 0}},
1863 {numPagesInMRI: {$eq: 0}},
1864 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1865 ]
1866 }
1867 },
1868 { $unwind: "$geoLocationCountryCode" },
1869 {
1870 $group: {
1871 _id: "nz",
1872 count: { $sum: 1 },
1873 domain: { $addToSet: '$domain' },
1874 numPagesInMRICount: { $sum: '$numPagesInMRI' },
1875 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
1876 }
1877 },
1878 { $sort : { count : -1} }
1879]);
1880
1881
1882Find pages for testing with:
1883 db.getCollection('Webpages').find({URL:/ashtangatauranga\.co\.nz/, containsMRI: true, mriSentenceCount: {$gt: 0}})
1884
1885
1886/* 1 */
1887{
1888 "_id" : "nz",
1889 "count" : 80.0,
1890 "domain" : [
1891X "http://www.zoomin.co.nz", [map site, so placenames]
1892X "http://www.biketorqueyamaha.co.nz", [placenames] {includes "http://biketorqueyamaha.co.nz"}
1893X "http://archerpix.com", [photo captions containing placenames]
1894X "http://philipbeadle.co.nz", [art captions containing placenames]
1895X "https://2019.nethui.nz", [Just MRI words in ENG sentences]
1896X "http://crimson.co.nz", [address]
1897+ "http://holyspirit.nz", (e.g. https://holyspirit.nz/wp-content/uploads/02-25-2018-Newsletter.pdf)
1898X "https://www.wingspan.co.nz", [1 page, 1 conjoined word, looks like placename]
1899X "http://nzpostcard.co.nz", [postcards with placenames]
1900+ "https://www.ngamanawainc.co.nz", (e.g. https://www.ngamanawainc.co.nz/history/id/60) {includes "http://www.ngamanawainc.co.nz"}
1901
1902+ "http://www.finlaysonpark.school.nz", [e.g. http://www.finlaysonpark.school.nz/58/pages/61-team-overview; Some Tongan language pages and Samoan pages]
1903X "http://artizani.co.nz", [address]
1904+ "http://www.w3vietnam.org.nz", [e.g. http://www.w3vietnam.org.nz/w3%20mihi%20ki%20ka%20hoia.htm] (includes "http://w3vietnam.org.nz")
1905X "https://sooty.nz", [names, war death notices, place names]
1906X? "http://rakaumanga.school.nz", [School has a longer Maori title name, but that's it] {includes "http://www.rakaumanga.school.nz"}
1907X "http://www.rotoruanz.com", [e.g. https://www.rotoruanz.com/RNZ/media/Media-Library/SOI-Board-version-08-June-2018-final-published-on-website.pdf]
1908X "https://www.cruisetourstauranga.co.nz", [English with Tauranga being MRI placename]
1909X "http://www.jeremybaker.nz", [one word, HOkio]
1910
1911X "https://liveresults.co.nz", [canoe sports team names]
1912X "http://rexedra.gen.nz", [ENG sentence with MRI words]
1913+ "https://www.takitimu.ac.nz", [school of performing arts, example MRI sentence at https://www.takitimu.ac.nz/about-us]
1914X "http://www.electionresults.govt.nz", [placenames and misdetection of Return to Home Page] {includes https://www.electionresults.org.nz which seems to be an alias for the same} {includes "http://archive.electionresults.govt.nz"}
1915+ "https://kotahimiriona.co.nz", e.g. (https://kotahimiriona.co.nz/about-the-app/)
1916+ "https://rehuamarae.co.nz", (e.g. https://rehuamarae.co.nz/)
1917+ "http://reoora.co.nz", (e.g. https://reoora.co.nz/about/)
1918
1919X "http://otorohanga.directorybusiness.co.nz", [placenames]
1920X "http://waitarahistory.org.nz", [placenames and misdetection of sentences of the form "31, No 262." as MRI]
1921+ "https://manawatuheritage.pncc.govt.nz", (e.g. https://manawatuheritage.pncc.govt.nz/about)
1922+ "http://rsnz.natlib.govt.nz", [e.g. http://rsnz.natlib.govt.nz/volume/rsnz_26/rsnz_26_00_004560.html] NOTE: site appears to use Greenstone
1923X "https://www.rotorua-rafting.co.nz", [placenames]
1924+ "https://www.taitokerautrust.org.nz", (e.g. https://www.taitokerautrust.org.nz/)
1925+ "http://tewikiotereomaori.nz", (e.g. https://tewikiotereomaori.nz/about/)
1926+ "https://www.korokikahukura.co.nz", (e.g. https://www.korokikahukura.co.nz/ko-wai-m257tou.html about the connection to the Waikato River)
1927
1928X "https://www.puhaandpakeha.co.nz", [ENG sentences with MRI words]
1929X "http://myfathersworld.net.nz", [placenames]
1930X "https://www.ashtangatauranga.co.nz", [misdetection]
1931+ "https://www.pinterest.nz", (e.g. sentence "E tu Whare Rangi, e tu Whare Pataka, kia ora koutou nga Whare Tipuna o te Ao Maori." at https://www.pinterest.nz/amp/marylizw03/maori/)
1932+ "https://www.rereahu.maori.nz", (e.g. https://www.rereahu.maori.nz/uploads/72717/files/168860/Te_Maru_o_Rereahu_Newsletter_No_5_February_2011.pdf)
1933+ "http://givealittle.co.nz", (e.g. https://givealittle.co.nz/cause/rebuild-tapu-te-ranga-marae which contains greetings phrases and sentence ""Nā te ringa tangata i hanga te whare Nā te tuarā o te whare i whakatipu i te tangata")
1934X "http://www.gans.co.nz", [placenames]
1935+ "https://kaiiwicamp.nz", [placenames] {includes "http://kaiiwicamp.nz"}
1936+ "http://ngarauhuia.ngatiapakiterato.iwi.nz", (e.g. http://ngarauhuia.ngatiapakiterato.iwi.nz/pdf/Nga-Rau-Huia_Raumati_2015_Issue1.pdf)
1937+ "https://m.wairarapatv.co.nz", (e.g. https://m.wairarapatv.co.nz/archive/i/JGmJqCZCNK8/te-ara-whanui-kura-kaupapa-maori-o-nga-khanga-reo-o-te-awa-kairangi)
1938
1939X "http://www.methodist.org.nz", [ENG sentence with MRI words]
1940+ "http://avonside.net", (e.g. http://avonside.net/Nicky/indexmain.htm)
1941X "http://www.ruralfind.co.nz", [placenames]
1942+ "http://www.maoriinvestments.co.nz", (e.g. Vision, Mission & Values tab of https://maoriinvestments.co.nz/about/organisation)
1943+ "http://conference.tpwt.maori.nz", (e.g. https://www.tpwt.maori.nz/te-mahe-matauranga/)
1944+ "https://www.puau.school.nz", (e.g. https://www.puau.school.nz/k%C4%81inga-home)
1945+? "http://ngatiwhakaue.iwi.nz", (e.g. greetings at http://ngatiwhakaue.iwi.nz/)
1946X "http://www.nzpcn.org.nz", [lots of misdetections and one sentence in English with MRI words, e.g. "Kaori is a Taonga to Maori".] {includes "http://dev.nzpcn.org.nz"}
1947+? "https://interactives.stuff.co.nz", [2 x page mostly empty except for the title "TE WIKI O TE REO MĀORI Māori"]
1948+ "http://tehauora.org.nz", (e.g. http://tehauora.org.nz/ and proverb at bottom of http://tehauora.org.nz/about-us)
1949
1950+ "http://temahurehure.maori.nz", (e.g. greeting http://temahurehure.maori.nz/site/file/temp/hirebook.pdf)
1951X "http://pukekohe.directorybusiness.co.nz", [placenames]
1952+!! "http://www.temarareo.org", (dictionary with sample sentences. e.g. http://www.temarareo.org/PAPAKUPU/dictionary-searchresults/A-2.htm)
1953X "https://www.tasteofplenty.co.nz", [Tauranga placename and misdetection of "No one went away hungry] {includes "http://www.tasteofplenty.co.nz"}
1954
1955+ "http://www.tetaumuturunanga.iwi.nz", (e.g. http://www.tetaumuturunanga.iwi.nz/wp-content/uploads/2016/04/Malvern-Cultural-Narrative-Draft.pdf)
1956
1957
1958X "https://www.blushandbrows.nz", [misdetection of "Makeup..."]
1959X "http://talkingtothecan.com", [misdetection of things like ""22, no." and mistaking ENG sentences with MRI words]
1960
1961+? "http://whatonga.school.nz", [school title]
1962+? "https://player.vimeo.com", [Video titles contain MRI sentences and MRI titles interspersed in ENG sentence. Video's audio content may be in MRI]
1963+ "http://www.writersfestival.co.nz", (e.g. http://www.writersfestival.co.nz/news/Page1/urgently-relevant-novel-wins-countrys-richest-literary-award/)
1964+? "http://southerntribes.co.nz", [brazilian jiu jitsu site with a greeting in MRI on main page]
1965+ "http://www.kmk.maori.nz", (e.g. http://www.kmk.maori.nz/kmk-events)
1966+ "https://www.stats.govt.nz", (e.g. http://archive.stats.govt.nz/Census/2001-census-data/2001-census-pacific-profiles/cook-island-maori-people-in-new-zealand.aspx)
1967X "http://www.eventcinemas.co.nz", [placenames and misdetection of ENG "Atura Hotels"]
1968X "https://www.zenbu.co.nz" [misdetection and NZ school addresses]
1969 ],
1970 "numPagesInMRICount" : 0,
1971 "numPagesContainingMRICount" : 1673
1972}
1973
197480 sites detected as having 0 pages inMRI but >0 pages that containMRI.
1975
1976[Of these 9 are part of the same site/subdomain => 71 unique sites.
1977Of the remaining ones, only 35 have at least one sentence in Maori and are marked with +. (Those marked with +? just have Maori titles or greetings or nothing more than a sentence.)
1978So in this set, there's a further 35 sites that contain MRI out of 71 unique sites detected as having pages containingMRI but not pages inMRI.
1979Total sites: 35/71
1980Total for NZ: (84+35)/(89+71) = 119/160 unique NZ sites have at least one webpage containing at least one sentence inMRI.
1981]
1982
1983TOTAL:
1984If counting subdomains and duplicated sites distinctly, then 35 + an additional 3 sites, making it 38/80 sites in this set.
1985
1986This makes (88+38)/(96+80) = 126/176 NZ sites (counting distinct subdomains and duplicated sites) that contain at least one web page with at least 1 sentence in MRI.
1987
1988
1989
1990
19913. GRAND TOTALS
1992
1993Count per country of web SITES that contain at least 1 web page containing at least 1 genuine MRI sentence. (Number in brackets for overseas is number of sites of that geolocation if nz TLDs were NOT grouped with NZ geolocation under "NZ". Number in brackets for NZ indicates the number of sites that are only of NZ geolocation ignoring nz TLDs hosted overseas.)
1994
1995OLD
1996countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI
1997NZ: 126 actual sites out of 176 (89) detected sites
1998US: 29 actual out of 422 (486) detected sites
1999AU: 2 actual out of 5 (21) detected sites
2000DE, Germany: 2 actual out of 27 detected sites
2001DK, Denmark: 2 out of 8
2002BG, Bulgaria: 1 out of 1
2003CZ, Czech Republic: 1 out of 4
2004ES, Spain: 1 out of 5 (7)
2005FR, France: 1 out of 35 (36)
2006IE, Ireland: 1 out of 2
2007
2008
2009TOTAL: 166 sites of all the crawled sites where the crawled set of pages per site actually contained at least one sentence in Māori based on manual inspection.
2010Out of a total of 221+471+176 = 868 sites that were detected with numPagesContainingMRI > 0 (868 sites containing at least one page with at least one sentence detected in MRI)
2011
2012========================================
Note: See TracBrowser for help on using the repository browser.