source: other-projects/maori-lang-detection/mongodb-data/ManualShortlisting2_afterMongoDBReingest.txt@ 33936

Last change on this file since 33936 was 33936, checked in by ak19, 4 years ago

Renaming old file to place with new counts after reingesting into MongoDB.

File size: 86.3 KB
Line 
1Want to MANUALLY go over all sites that are detected as containing one or more pages with at least an MRI sentence
2and shortlist those sites genuinely containing at least one MRI sentence.
3
4
5Total num sites detected as containing MRI:
6 db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
7=869
8
9
10To make the manual task easier,
11splitting the results of all sites with numPagesContainingMRI > 0 into NZ sites and overseas sites,
12since NZ sites are more likely to contain MRI content.
13
14-----------------------------------------------------------
15A. OVERSEAS SITES: sites not NZ in origin NOR .nz TLD SITES
16-----------------------------------------------------------
17Further splitting the overseas sites into a set with an mi in the URL path (mi.* or */mi) and those without,
18since overseas sites with mi in the URL path are more likely to be automatically translated product sites.
19
201. db.getCollection('Websites').find(
21{$and: [
22 {numPagesContainingMRI: {$gt: 0}},
23 {geoLocationCountryCode: {$ne: "NZ"}},
24 {domain: {$not: /.nz$/}},
25 {urlContainsLangCodeInPath: {$ne: true}}
26]}).count()
27
28= 221 websites
29
30[Treating Australia as a special case since one of the 4 Australian sites with numPagesContainingMRI > 0
31had an mi in the URL path but was not automatically translated
32
33# counts by country code excluding NZ related sites
34
35db.getCollection('Websites').find({$and: [
36 {geoLocationCountryCode: {$ne: "NZ"}},
37 {domain: {$not: /\.nz/}},
38 {numPagesContainingMRI: {$gt: 0}},
39 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
40 ]}).count()
41
42= 221 websites
43
44
45Both values are the same. This means that after reingesting into MongoDB, there are no longer any Australian sites with /mi in the URL path. Previously, manual inspection found kiwiproperty.com with geoLocation of Australia which was a genuine site of interest from manual inspection. But since its geolocation upon reingest has changed to US, we no longer have to treat that site and therefore Australian sites with mi in their URL paths specially.
46
47
48Getting a domain listing of the sites that matched, per country:
49db.Websites.aggregate([
50 {
51 $match: {
52 $and: [
53 {geoLocationCountryCode: {$ne: "NZ"}},
54 {domain: {$not: /\.nz/}},
55 {numPagesContainingMRI: {$gt: 0}},
56 {urlContainsLangCodeInPath: false}
57 ]
58 }
59 },
60 { $unwind: "$geoLocationCountryCode" },
61 {
62 $group: {
63 _id: {$toLower: '$geoLocationCountryCode'},
64 count: { $sum: 1 },
65 domain: { $addToSet: '$domain' }
66 }
67 },
68 { $sort : { count : -1} }
69]);
70
71 /* 1 */
72 {
73 "_id" : "us",
74 "count" : 120.0,
75 "domain" : [
76 "http://lianzaconference2012.blogspot.com",
77 "https://www.pinterest.ca",
78 "http://takethatvacation.com",
79 "https://www.indexmundi.com",
80 "http://ngarangatahi.tripod.com",
81 "http://frontrowphotos.com",
82 "https://www.nccri.ie",
83 "http://niken8media.logdown.com",
84 "https://www.seapixonline.com",
85 "https://www.code-postal.com",
86 "http://www.muhammad.com",
87 "https://static-promote.weebly.com",
88 "http://www.unicode.org",
89 "http://anglicanhistory.org",
90 "http://rangiwewehi.com",
91 "https://wol.jw.org",
92 "http://www.pressreader.com",
93 "http://linkvip.top",
94 "https://www.podrozeady.com",
95 "http://www.thesalmons.org",
96 "http://shangrilapress.net",
97 "http://georgegi.tripod.com",
98 "https://www.terakau.org",
99 "http://svenskadress.net",
100 "http://malecek.com",
101 "http://word-dialect.blogspot.com",
102 "https://www.blue-frontiers.com",
103 "http://atopeconlostopes.blogspot.com",
104 "http://dannykahei.tripod.com",
105 "https://www.oemsec.com",
106 "http://wikiedit.org",
107 "https://www.dbnames.net",
108 "http://www.godrules.net",
109 "http://www.huapala.org",
110 "https://www.pinterest.jp",
111 "https://kjohnsonnz.blogspot.com",
112 "http://www.gotquestions.org",
113 "http://tuhua2010.blogspot.com",
114 "http://www.twttoa.com",
115 "http://pumanawawhangara.blogspot.com",
116 "http://hannas-reiseblog.blogspot.com",
117 "https://nl.pinterest.com",
118 "https://www.myadsclassified.com",
119 "http://mikebonnice.com",
120 "https://www.webwiki.com",
121 "http://fhr.kiwicelts.com",
122 "https://articles.imperialtometric.com",
123 "http://kiaorahola.blogspot.com",
124 "http://ww25.milfsplease.com",
125 "http://daandehn.com",
126 "http://www.precious-testimonies.com",
127 "https://www.pinterest.it",
128 "https://www.pinterest.co.uk",
129 "http://naturalfatburner.net",
130 "https://www.vaihaunui.net",
131 "http://capsuraotearoa.blogspot.com",
132 "http://m.biblepub.com",
133 "http://shuttersportnelson.photoshelter.com",
134 "http://precious-testimonies.com",
135 "http://wowwars.net",
136 "https://www.breaker.audio",
137 "http://tkrow.tripod.com",
138 "http://ritusehji.blogspot.com",
139 "http://seapixonline.com",
140 "http://www.whoisthatr.com",
141 "https://livestream.com",
142 "https://biblehub.com",
143 "https://www.pipirikiapapatuanuku.org",
144 "http://www.wikitree.com",
145 "http://bahaiprayers.net",
146 "https://phet.colorado.edu",
147 "http://tatai09.blogspot.com",
148 "http://www.hudl.com",
149 "https://ebible.org",
150 "http://rhymebrain.com",
151 "http://tkkpipipaopao.blogspot.com",
152 "http://www.waimate.com",
153 "http://piripi.blogspot.com",
154 "http://burkekm001.tripod.com",
155 "https://www.hidroponia.org.mx",
156 "http://www.v3whois.com",
157 "http://www.the-naked.com",
158 "https://www.pinterest.fr",
159 "http://maaori.com",
160 "http://loquevendra318.com",
161 "http://www.geni.com",
162 "https://maorinews.com",
163 "http://www.frogsonline.com",
164 "https://drive.google.com",
165 "https://in.pinterest.com",
166 "http://www.mkiwi.com",
167 "https://www.kaifineart.com",
168 "http://www.roadsmile.com",
169 "https://png.bible",
170 "http://blogdepasopor.blogspot.com",
171 "http://www.steve-wheeler.co.uk",
172 "http://www.whoisentry.com",
173 "http://anglican.org",
174 "http://www.eyecontactsite.com",
175 "http://aclhokiangarocks.blogspot.com",
176 "http://manateina.blogspot.com",
177 "https://www.knowatom.com",
178 "https://chromium.googlesource.com",
179 "https://za.pinterest.com",
180 "http://mahoraroom8.blogspot.com",
181 "https://www.bible.com",
182 "http://worldradiomap.com",
183 "http://www.hiroa.pf",
184 "http://www.lunar-occultations.com",
185 "https://docs.google.com",
186 "http://www.krassotkin.ru",
187 "http://www.namesdir.com",
188 "https://www.poehalisnami.ua",
189 "http://www.forensicfashion.com",
190 "http://eartheum.com",
191 "http://www.code-postal.com",
192 "http://mrshamiltonskoolkidz.blogspot.com",
193 "https://www.natekore2018.com",
194 "http://korora.econ.yale.edu"
195 ]
196 }
197
198 /* 2 */
199 {
200 "_id" : "de",
201 "count" : 19.0,
202 "domain" : [
203 "http://vulkane.ch",
204 "http://www.stephe.de",
205 "https://ersatzteile-fachversand.de",
206 "http://etoile-de-lune.net",
207 "https://www.cartogiraffe.com",
208 "https://laskar02cinta.page.tl",
209 "http://www.cartogiraffe.com",
210 "http://www.udhr.de",
211 "http://klaaskoehne.de",
212 "http://m.distanta.1km.net",
213 "http://insecta.pro",
214 "http://weltderberge.de",
215 "http://arts.mythologica.fr",
216 "http://www.behlig.de",
217 "http://svenkirsten.com",
218 "http://etymologie.info",
219 "http://www.nierstrasz.org",
220 "https://www.tvteile.de",
221 "https://www.you-fly.com"
222 ]
223 }
224
225 /* 3 */
226 {
227 "_id" : "fr",
228 "count" : 16.0,
229 "domain" : [
230 "http://www.gif.ovh",
231 "http://pt.city-usa.net",
232 "http://www.maraamusurfskirace.com",
233 "http://rapanui.fr",
234 "http://kihikihi.fr",
235 "http://blueheavenisland.com",
236 "http://www.gototahiti.net",
237 "http://www.gaudry.be",
238 "http://www.rongo-rongo.com",
239 "http://chantsdeluttes.free.fr",
240 "http://www.blueheavenisland.com",
241 "http://baladeornithologique.com",
242 "http://mahajana.net",
243 "https://www.lexilogos.com",
244 "https://www.manualscat.com",
245 "http://splaf.free.fr"
246 ]
247 }
248
249 /* 4 */
250 {
251 "_id" : "nl",
252 "count" : 16.0,
253 "domain" : [
254 "http://gouvernante.info",
255 "http://www.gouvernante.info",
256 "https://arrowheadproject.azurewebsites.net",
257 "http://hidsonphoto.com",
258 "http://skimap.info",
259 "http://tetsubo.org",
260 "https://arrowhead.eu",
261 "https://www.henrifloor.nl",
262 "http://diverosa.com",
263 "https://www.arrowhead.eu",
264 "http://wearehomework.com",
265 "http://nielsonboutique.co.uk",
266 "http://tonhut.nl",
267 "http://longhornlaw.net",
268 "http://www.nonlinear.demon.nl",
269 "http://www.encyclo.co.uk"
270 ]
271 }
272
273 /* 5 */
274 {
275 "_id" : "dk",
276 "count" : 8.0,
277 "domain" : [
278 "http://jazz.ngapuhitelevision.com",
279 "http://komisch.ngapuhitelevision.com",
280 "http://powhiri.ngapuhitelevision.com",
281 "http://waiatarangatiratanga.ngapuhitelevision.com",
282 "http://ngapuhiradio.com",
283 "http://ngapuhitelevision.com",
284 "http://www.rennertweb.de",
285 "http://akona.ngapuhitelevision.com"
286 ]
287 }
288
289 /* 6 */
290 {
291 "_id" : "cz",
292 "count" : 5.0,
293 "domain" : [
294 "https://www.fipojobs.com",
295 "http://about.ilikeyou.com",
296 "https://www.viveipcl.com",
297 "http://www.henryklahola.nazory.cz",
298 "http://henryklahola.nazory.cz"
299 ]
300 }
301
302 /* 7 */
303 {
304 "_id" : "ca",
305 "count" : 5.0,
306 "domain" : [
307 "http://www.myrasplace.net",
308 "http://bcmarina.com",
309 "http://bckayak.com",
310 "http://aguadilla.airport-authority.com",
311 "http://00.gs"
312 ]
313 }
314
315 /* 8 */
316 {
317 "_id" : "gb",
318 "count" : 4.0,
319 "domain" : [
320 "http://www.woolrych.org",
321 "https://omniatlas.com",
322 "http://www.wordsearchfun.com",
323 "http://mikestephens.co.uk"
324 ]
325 }
326
327 /* 9 */
328 {
329 "_id" : "es",
330 "count" : 4.0,
331 "domain" : [
332 "https://www.uv.es",
333 "http://www.cruceros-princess.mx",
334 "https://www.reclamaciondevuelos.com",
335 "http://www.info-hoteles.com"
336 ]
337 }
338
339 /* 10 */
340 {
341 "_id" : "au",
342 "count" : 4.0,
343 "domain" : [
344 "https://koreromaori.com",
345 "http://theunderwaterworld.com",
346 "https://infogram.com",
347 "http://fionajack.net"
348 ]
349 }
350
351 /* 11 */
352 {
353 "_id" : "it",
354 "count" : 3.0,
355 "domain" : [
356 "http://oipaz.net",
357 "http://www.marcosanti.it",
358 "http://www.pegasoesmicamion.com"
359 ]
360 }
361
362 /* 12 */
363 {
364 "_id" : "at",
365 "count" : 3.0,
366 "domain" : [
367 "http://www.petit-prince.at",
368 "http://petit-prince.at",
369 "http://www.tmtmm.net"
370 ]
371 }
372
373 /* 13 */
374 {
375 "_id" : "ch",
376 "count" : 2.0,
377 "domain" : [
378 "https://photos.axelebert.org",
379 "https://nicoledidi.ch"
380 ]
381 }
382
383 /* 14 */
384 {
385 "_id" : "ro",
386 "count" : 2.0,
387 "domain" : [
388 "http://parohiauceadesus.ro",
389 "http://www.parohiauceadesus.ro"
390 ]
391 }
392
393 /* 15 */
394 {
395 "_id" : "unknown",
396 "count" : 1.0,
397 "domain" : [
398 "https://www.hitiaotera.com"
399 ]
400 }
401
402 /* 16 */
403 {
404 "_id" : "fi",
405 "count" : 1.0,
406 "domain" : [
407 "http://pertti.com"
408 ]
409 }
410
411 /* 17 */
412 {
413 "_id" : "jp",
414 "count" : 1.0,
415 "domain" : [
416 "http://yutaka.it-n.jp"
417 ]
418 }
419
420 /* 18 */
421 {
422 "_id" : "mx",
423 "count" : 1.0,
424 "domain" : [
425 "http://www.gelbukh.com"
426 ]
427 }
428
429 /* 19 */
430 {
431 "_id" : "ru",
432 "count" : 1.0,
433 "domain" : [
434 "https://www.gismeteo.lv"
435 ]
436 }
437
438 /* 20 */
439 {
440 "_id" : "bg",
441 "count" : 1.0,
442 "domain" : [
443 "http://anitra.net"
444 ]
445 }
446
447 /* 21 */
448 {
449 "_id" : "ie",
450 "count" : 1.0,
451 "domain" : [
452 "https://coggle.it"
453 ]
454 }
455
456 /* 22 */
457 {
458 "_id" : "cn",
459 "count" : 1.0,
460 "domain" : [
461 "http://kiwi2china.com"
462 ]
463 }
464
465 /* 23 */
466 {
467 "_id" : "ir",
468 "count" : 1.0,
469 "domain" : [
470 "https://www.dideo.ir"
471 ]
472 }
473
474 /* 24 */
475 {
476 "_id" : "il",
477 "count" : 1.0,
478 "domain" : [
479 "http://www.daat.ac.il"
480 ]
481 }
482
483
484
485Can inspect websites' pages for whether it's relevant vs auto-translated as follows:
486 db.getCollection('Webpages').find({URL:/svenkirsten.com/, mriSentenceCount: {$gt: 0}})
487
488
489* CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/
490 BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/
491
492* FR: 16 sites from FR
493 http://blueheavenisland.com, http://www.blueheavenisland.com - misdetection. French Polynesia
494 https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway
495 http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori.
496!! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
497 http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names
498X http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
499 http://mahajana.net - misdetected a Japanese Zen Buddhist chant as MRI
500 http://rapanui.fr - Rapa Nui Easter Island. Misdetected.
501 http://www.gif.ovh - autotranslated pages. Supposedly a GIF repository
502 http://baladeornithologique.com - misdetection of the word "Retour"
503 http://www.gaudry.be - misdetection of Japanese hiragana etc, and French "faire", as MRI
504 http://www.gototahiti.net - probably misdetection, see title
505 http://www.maraamusurfskirace.com - Bora Bora, French Polynesia. Misdetected.
506 http://www.rongo-rongo.com - appears to be related to Easter Island. Just 1 sentence however.
507 http://pt.city-usa.net - misdetection. Hawaii.
508 https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages.
509NL:
510(!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm]
511- https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL
512- tonhut.nl - misidentication
513? http://nielsonboutique.co.uk, http://longhornlaw.net, http://tetsubo.org, http://hidsonphoto.com, http://wearehomework.com/- Feels autotranslated, but no language options visible. All SEO related
514- diverosa.com - Rapa Nui, Easter Island
515- nonlinear.demon.nl - misidentified
516- encyclo.co.uk - misidentification
517- henrifloor.nl - misidentification
518- http://skimap.info/ - maps, NZ placenames in PDF
519DK:
520!! ++ http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,
521http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com,
522http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com
523- http://www.rennertweb.de - a photogallery page mentioning NZ placenames
524CA:
525- http://bcmarina.com AND http://bckayak.com - photos with Canadian placenames
526- http://www.myrasplace.net - pagse of photos, captions involving NZ placenames
527~ http://00.gs/Maniapoto;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing
528- aguadilla.airport-authority.com - misidentification
529[MOVED TO US: - https://articles.imperialtometric.com - misidentification]
530[MOVED TO US: - http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames]
531DE:
532- http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful
533!! https://www.cartogiraffe.com/ and https://www.cartogiraffe.com - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
534~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden"
535- herocity - autotranslated
536- weltderberge.de - 3 pages mention NZ mountains by name.
537~ (arts.mythologica.fr) https://mythologica.fr/oceanie/texte/pantheon_polynesien.pdf - mentions certain Maori Gods and other Polynesian Gods by name.
538- https://traynews.com - nothing in MRI, misdetected
539~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names
540- http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI
541X https://afrikhepri.org/mi/ - autotranslated
542- https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word
543- etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected
544- https://www.you-fly.com - misdetection of German "Warum?" as MRI
545- http://vulkane.ch - misdetected pages on Hawaiian volcanoes.
546- http://www.stephe.de - photos from NZ captioned with NZ placenames
547- http://insecta.pro - misdetection
548- http://m.distanta.1km.net - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence.
549- https://ersatzteile-fachversand.de - German misdetected as Maori.
550- https://laskar02cinta.page.tl/Info.htm - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense.
551- http://www.behlig.de - misdetection. Photos from Hawaii.
552!! http://www.udhr.de - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available.
553- ITALY:
554 http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename
555 http://www.marcosanti.it/Reportage/Oceania_ph/Nuova_Zelanda/ - each photo captioned by NZ placename
556 http://www.pegasoesmicamion.com/ - REO abbreviation misidentified, also in REO%20PUBLICIDAD.htm
557- AUSTRIA:
558 petit-prince.at - Tahitian and Wayuu (Venezuela) translations of Le Petit Prince
559 http://www.tmtmm.net/newzealand - photos from NZ named after places and people's names
560- ROMANIA: parohiauceadesus.ro - Sentences of single Romanian words misidentified.
561- ISRAEL:
562 http://www.daat.ac.il - misidentification of "no." as MRI, and Hebrew words.
563 [MOVED TO UNKNOWN: https://www.hitiaotera.com/ - misidentifiation of Tahitian pages]
564- RUSSIA: https://www.gismeteo.lv - misidentification of an email address
565- JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI
566!! - Ireland, ie: https://coggle.it
567- IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website
568- CZECH republic:
569? https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English
570!! http://www.henryklahola.nazory.cz/094.Maori.htm and http://henryklahola.nazory.cz variant
571 http://about.ilikeyou.com - dating site. Misidentification.
572 GAINED FROM UNKNOWN: https://www.viveipcl.com: tours website, placenames mentioned]
573- SPAIN:
574!! https://www.uv.es/~pla/red.net/intmaori.html
575 https://www.reclamaciondevuelos.com - 2 occurrences of the word "kiwi"
576 http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename
577 http://www.cruceros-princess.mx/princessMX/Oferta_Cruzeiros_Polinesia.html - Polynesian placenames
578- SINGAPORE: https://omg-solutions.com - autotranslated
579- TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated
580- MEXICO: http://www.gelbukh.com - misidentification, lines of just numbers or phrases like "Area Chair" in English and Russian CVs.
581- FINLAND: http://pertti.com - travelogue, placenames
582- SWITZERLAND CH:
583 nicoledidi.ch - blog, placenames
584 https://photos.axelebert.org - Tahiti related content
585- UNKNOWN:
586[MOVED TO CZ: https://www.viveipcl.com: tours website, placenames mentioned]
587GAINED FROM IL: https://www.hitiaotera.com/ - misidentifiation of Tahitian pages
588#- EU: https://www.the-good-stuff-factory.be/mi/ : Autotranslated
589!! - BULGARIA: http://anitra.net/activism/humanrights/UDHR/rrt_print.htm (2 pages)
590
591
592AUSTRALIA:
593[MOVED TO US: !! https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]]
594? http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!")
595X!! https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd - site of individual pages (like docs.google.com). This one has a relevant infogram image. But it's English with MRI in the image legend and captions.
596!! https://koreromaori.com - some actual Maori language sentences
597 http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames
598
599UK:
600 http://www.wordsearchfun.com/200628_Word_Find_wordsearch.html - 2 word games with Maori words (one of them has 3 different views, e.g. print view)
601? https://omniatlas.com/maps/australasia/18400206/plain/ - historical map with Maori iwi names over NZ map regions
602? https://omniatlas.com/maps/australasia/18400206/ - historical map of Australia and NZ at the time of the Treaty of Waitangi, with events marked in English
603 https://centrallanguageschool.com - AUTOTRANSLATED
604 https://www.solasolv.com - Autotranslated product site
605 http://mikestephens.co.uk/ - photo captions containing NZ placenames
606 http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames
607
608
609US:
610Done: manually inspected 69/120 sites
611
612TOTAL US: 1+4+7+7+4+3=26
613
614
615US GAINED AFTER REINGEST:
616+ anglican.org
617GAINED FROM CA: - https://articles.imperialtometric.com - misidentification
618GAINED FROM CA: - http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames
619
620DEFINITELY:
621+ http://anglicanhistory.org,
622+ http://www.unicode.org, [Universal declaration of Human Rights]
623+ https://static-promote.weebly.com,
624+ http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY. At least short or partial MRI sentences.]
625
626BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations:
627+ http://bahaiprayers.net, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too]
628+ https://biblehub.com,
629+ http://www.muhammad.com, [possibly not autotranslated]
630+ http://www.godrules.net, [possibly not autotranslated]
631+ http://m.biblepub.com,
632+ http://www.krassotkin.ru, [probably real translations, as there are multiple Dutch translations from different sources provided]
633+ http://www.gotquestions.org, [doesn't appear autotranslated]
634X https://ebible.org, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori]
635X https://www.bible.com, doesn't have Maori translation. Misdetected.
636X https://wol.jw.org, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected
637X https://png.bible, [misdetected, Papua New Guinea]
638X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters.
639
640CHECK, PROBABLY - PROCESSED:
641!! https://maorinews.com,
642!! http://maaori.com,
643!!+ http://kiaorahola.blogspot.com,
644+ https://kjohnsonnz.blogspot.com,
645+ http://pumanawawhangara.blogspot.com,
646+ http://dannykahei.tripod.com,
647+ http://burkekm001.tripod.com,
648+ http://tkkpipipaopao.blogspot.com,
649+ http://manateina.blogspot.com,
650? tkkpipipaopao.blogspot.com? http://rangiwewehi.com, [English, but community]
651? https://www.terakau.org, [COMMUNITY, but English]
652? https://www.pipirikiapapatuanuku.org, [COMMUNITY?, in English, environment site]
653~ http://georgegi.tripod.com,
654~ http://ngarangatahi.tripod.com, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden]
655X http://fhr.kiwicelts.com,
656X http://tkrow.tripod.com, [English, background of NZ place]
657X http://www.mkiwi.com, - placenames
658X http://www.waimate.com, [English, NZ place]
659
660MAYBE, INSPECT - PROCESSED:
661? https://www.natekore2018.com, [lots of English, but COMMUNITY, CULTURE]
662+ http://tatai09.blogspot.com,
663+ http://www.twttoa.com,
664+ http://tuhua2010.blogspot.com,
665X http://www.huapala.org, [misdetected, Hawaiian]
666X https://www.vaihaunui.net, [misdetected, Tahiti]
667X https://www.kaifineart.com, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected]
668X http://mahoraroom8.blogspot.com, [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI
669+ http://piripi.blogspot.com,
670X http://www.hiroa.pf, [misdetected. Crawled content appears Polynesian not Maori]
671X http://korora.econ.yale.edu, [NZ place photo caption]
672X https://www.poehalisnami.ua, [mostly Cyrillic, with some NZ or Polynesian names misdetected]
673X http://hannas-reiseblog.blogspot.com - one page contained NZ placenames, another had a word misdetected
674
675
676+ https://www.breaker.audio, [audio, with occasional English.]
677? https://livestream.com, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?]
678
679X https://docs.google.com, timetable with occasional Maori language word
680+ https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position.
681http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
682
683
684PINTEREST
685+ https://in.pinterest.com/pin/317363104978423418/
686 "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google"
687? https://za.pinterest.com/pin/524669425310419500/
688 Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE]
689[The other pinterest detected as numPagesContainingMRI > 0 was misdetected]
690
691https://nl.pinterest.com,
692https://www.pinterest.jp,
693https://www.pinterest.it,
694https://www.pinterest.co.uk,
695https://www.pinterest.ca,
696https://za.pinterest.com,
697https://www.pinterest.fr,
698https://in.pinterest.com,
699
700MORE BLOGSPOTS
701X http://word-dialect.blogspot.com, [Indonesian, misdetected]
702~ http://atopeconlostopes.blogspot.com, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.]
703X http://lianzaconference2012.blogspot.com, [NZ placename or institution]
704? http://mrshamiltonskoolkidz.blogspot.com, [te reo Maori related school activities. Described in English.]
705X http://capsuraotearoa.blogspot.com, [blog in French, photo captions contain NZ placenames]
706X http://blogdepasopor.blogspot.com, [blog in French, Rapa Nui/Easter Island related content, misdetected.]
707
708
709UNLIKELY
710?? http://naturalfatburner.net, http://naturalfatburner.net/NoNonsenseTed/fatloss-mao/ feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick]
711
712
713BLACKLIST:
714X http://ww25.milfsplease.com,
715X http://www.the-naked.com
716
717OTHER:
718X http://seapixonline.com, https://www.seapixonline.com, [photo captions of ships. Sometimes misdetected Japanese words as MRI.]
719X http://www.code-postal.com, https://www.code-postal.com, [not more than 1 sentence detected as in MRI]
720X https://www.dbnames.net, [Name database, lots misdetected]
721
722STILL TO DO LIST - PROCESSED:
723
724X https://www.myadsclassified.com, [misdetected 3 short English sentences as MRI]
725X http://www.whoisthatr.com, [misdetected short English sentence as MRI]
726X https://www.oemsec.com, [autotranslated product site]
727X http://svenskadress.net, [linkfarm like site of related junk links, contained URLs misdetected as MRI]
728
729X https://www.webwiki.com, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ]
730X http://mikebonnice.com, [Hawaiian and Tahiti related content misdetected]
731X http://www.hudl.com, [misdetected short English sentence as MRI]
732X http://www.wikitree.com, [misdetected short English sentence as MRI]
733X http://shuttersportnelson.photoshelter.com, [image captions of "Wairua Warrior"]
734
735X http://niken8media.logdown.com, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.]
736X https://www.podrozeady.com, Looks Polish or other East-European language. The NZ page https://www.podrozeady.com/NZ/4/ had placenames detected.
737
738X http://www.thesalmons.org, [detection and misdetection of author names of papers hosted]
739
740X http://linkvip.top, [.rar and media file links misdetected as MRI]
741
742
743X http://www.lunar-occultations.com, [NZ place names for astronomical phenomena]
744X http://shangrilapress.net, [NZ placenames]
745X http://malecek.com, [misdetection CD title]
746X https://www.blue-frontiers.com, [Tahitian, Reo Tahiti, misdetected as MRI]
747X http://www.whoisentry.com, [URL names, looked at several which were probably misdetected as MRI]
748X http://loquevendra318.com, [uses Google translate for auto-translation]
749
750
751?? http://www.forensicfashion.com, [historical information, useful for CULTURE? e.g. http://www.forensicfashion.com/1807MaoriChief.html]
752
753X http://www.eyecontactsite.com, [Lots of names. And a few short sentences or words possibly in comments.]
754X http://eartheum.com, [Rapa Nui, Easter Island related content. Misdetected]
755X http://www.steve-wheeler.co.uk, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"]
756X https://chromium.googlesource.com, [some source code related to languages' two letter codes]
757
758X http://www.roadsmile.com, [Lots of misdetection based on word Kia.]
759?? https://www.knowatom.com, https://phet.colorado.edu [Similar looking science web sites for children. Uses auto-translation?]
760
761X https://www.indexmundi.com, [place names. Pages about Solomon Islands. Misdetection of placenames.]
762
763
764
765X http://wowwars.net, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ]
766?? https://www.hidroponia.org.mx, [Not sure if https://www.hidroponia.org.mx/index.php/idiomas/284-hydroponics-te-ahurea-wai-maori is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?]
767X http://www.v3whois.com, [URLs are misdetected as MRI]
768X http://rhymebrain.com, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages]
769
770
771X SINGLE SENTENCE DETECTED (NO MORE AND NOT WHOLE PAGE isMRI:)
772 http://frontrowphotos.com,
773 http://www.pressreader.com,
774 https://www.nccri.ie,
775 http://takethatvacation.com,
776 http://worldradiomap.com,
777 http://www.namesdir.com,
778
779 X http://www.frogsonline.com, [NZ hotels, placenames]
780 X http://www.geni.com, [Single sentence misdetection]
781 X http://wikiedit.org, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui]
782
783
784
785TOTALS:
786US: 26
787AU: 1
788DE: 2
789DK: 2
790BG: 1
791CZ: 1
792ES: 1
793FR: 1
794IE: 1
795TOTAL: (assuming 176 for NZ) + 36 = 212
796
797------------------------------------------------
7982. Need to inspect all those sites with any webPAGE that has mi in its URL path (mi.* or */mi) that are not sites with nz TLD or originating in NZ:
799
800db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
801472
802
803(vs:
804db.getCollection('Websites').find({$and: [{numPagesInMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
805209)
806
807
808db.Websites.aggregate([
809 {
810 $match: {
811 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]
812 }
813 },
814 {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}},
815 { $sort : { count : -1} }
816])
817
818(No longer special handling for AU, as we saw earlier.)
819
820/* 1 */
821{
822 "_id" : "US",
823 "count" : 302.0,
824 "domain" : [
825 "https://www.tkthvac.com",
826 "http://mi.gmpmetalwork.com",
827 "http://www.huaxinfurnace.com",
828 "http://mi.tccasdic.com",
829 "https://www.waterproof-factory.com",
830 "http://www.omnicnc.com",
831 "http://www.bdknitting.com",
832 "http://www.prostepper.com",
833 "http://www.tkfanen.com",
834 "http://www.sdcncrouter.com",
835 "http://www.china-brewhouse.com",
836 "http://www.twtvalvecn.com",
837 "http://www.zhengmaoelec.com",
838 "http://www.hqftex.com",
839 "https://mi.centr-zashity.ru",
840 "http://www.szebo.com",
841 "http://www.jointcontrols.net",
842 "http://www.hobbycarbon.com",
843 "https://www.nickel-alloy.net",
844 "http://www.10turntables.com",
845 "https://www.inpnurseryproducts.com",
846 "https://www.risenltd.com",
847 "http://www.ncpcpharma.com",
848 "http://www.weld-automation.com",
849 "http://www.gfh-electric.com",
850 "http://www.tongyujiaju.com",
851 "http://www.nide-industry.com",
852 "http://www.nicehut-window.com",
853 "http://www.acouplefortheroad.com",
854 "https://www.aquagem.com.cn",
855 "https://www.tymexnetting.com",
856 "http://www.ruk-tech.com",
857 "http://www.yrkseal.com",
858 "http://www.ainuogas.com",
859 "http://blicanada.net",
860 "http://www.goethe.de",
861 "https://www.njkeyuda.com",
862 "http://topbitcoincard.com",
863 "http://www.fancyco.com",
864 "http://www.chinagxmy.com",
865 "http://www.cnfeinade.com",
866 "http://www.longxin-global.com",
867 "http://www.nyforgedwheels.com",
868 "http://www.sog-pump.com",
869 "http://www.inpnurseryproducts.com",
870 "http://www.wanmaroto.com",
871 "http://www.yixinhetrade.com",
872 "http://www.b-packaging.com",
873 "http://www.bluekin.com",
874 "http://www.ncpcvet.com",
875 "https://www.glorystarlaser.com",
876 "http://www.shengrunqiche.com",
877 "http://www.wellfit-sportswear.com",
878 "http://csunplugged.org",
879 "https://www.kiwiproperty.com",
880 "http://www.infomutt.com",
881 "http://www.photoprofix.com",
882 "https://drugsinc.eu",
883 "http://www.ttyzfilter.com",
884 "http://www.nicerelay.com",
885 "https://www.gigalight.com",
886 "https://www.sinodryair.com",
887 "http://www.ladybagcn.com",
888 "http://www.cnrgxy.com",
889 "http://www.honglu-mining.com",
890 "http://www.kehengmixing.com",
891 "http://www.cnfreda.com",
892 "http://www.longs-motor.com",
893 "http://www.xzc9.com",
894 "http://www.dmdryer.com",
895 "http://www.ksdoing.com",
896 "http://www.mytrickstips.com",
897 "http://www.focusway-casting.com",
898 "http://www.americasportsfloor.com",
899 "https://cycletraderpro.com",
900 "http://www.chinabosun.com",
901 "https://www.everfineplastics.com",
902 "http://mi.guoguangelectric.com",
903 "http://www.albertnovosino.com",
904 "http://www.evergrowingcage.com",
905 "http://www.seasum.cn",
906 "http://www-hotmail-com.email",
907 "http://www.cnyaonan.com",
908 "http://www.ntvigourbrush.com",
909 "http://www.quickcncmachine.com",
910 "https://www.hengweihoseclamp.com",
911 "http://www.sokenswitch.com",
912 "http://www.soontruepackaging.com",
913 "https://www.rikoooo.com",
914 "http://www.cnxh-electric.com",
915 "http://www.teda-hydraulic.com",
916 "http://www.strongsaw.com",
917 "https://www.prostepper.com",
918 "http://www.pressurelantern.com",
919 "http://www.hs-stationery.com",
920 "http://www.nbbvc.com",
921 "http://lingeriefc.com",
922 "http://www.evaescort.net",
923 "http://www.kd-physicalrehab.com",
924 "http://www.chuamotor.com",
925 "http://cdn.centrallanguageschool.com",
926 "https://worldstarhiphop.roseconverter.com",
927 "https://www.csunplugged.org",
928 "http://www.qypaperbox.com",
929 "https://www.junschem.com",
930 "http://www.gormeet.com",
931 "http://www.szhaiwang.com",
932 "http://www.wzdongyi.com",
933 "http://www.jlgrating.com",
934 "http://www.nantaidiesel.com",
935 "http://www.zhenchengscrew.com",
936 "http://www.accotech.net",
937 "https://atoall.com",
938 "https://mi.wikipedia.org",
939 "https://usahello.org",
940 "http://www.gemnice.com",
941 "http://www.richina-tools.com",
942 "http://www.samewe.net",
943 "http://www.liweimetal.com",
944 "http://www.pxbaisheng.com",
945 "http://www.jiejingfactory.com",
946 "http://www.meihua-wm.com",
947 "http://www.jiajiebathmirror.com",
948 "http://www.touchdisplays-tech.com",
949 "http://www.sdtzgloves.com",
950 "http://www.forever-moving.com",
951 "http://www.cannapresso.com",
952 "http://www.aluminum-profiles-supplier.com",
953 "http://indigenousblogs.com",
954 "http://www.btmeac.com",
955 "http://www.longda-inc.com",
956 "http://www.conele-mixer.com",
957 "http://www.brushcutterjusen.com",
958 "https://mi.m.wikipedia.org",
959 "https://www.judinwire.net",
960 "http://www.toption-ingredients.com",
961 "https://www.fctele.com",
962 "http://www.ledecofr.com",
963 "https://www.drickinstruments.com",
964 "https://policies.oclc.org",
965 "http://www.lanlinprintech.com",
966 "http://www.qjfiberglass.com",
967 "https://www.huadongmedical.com",
968 "http://www.hzhinew.com",
969 "http://www.envicool.net",
970 "http://www.steel-in-china.com",
971 "https://mamaclub.info",
972 "https://www.conele-mixer.com",
973 "https://www.jlextract.com",
974 "http://www.chinaocan.com",
975 "http://www.htwindsolarpower.com",
976 "https://mi.nyecountdown.com",
977 "http://www.gecko-kalimba.com",
978 "https://www.tjshenzhoutong.com",
979 "http://www.vigor-industry.com",
980 "https://maxspeedtest.com",
981 "http://www.sunnymaycn.com",
982 "http://www.tangres100.com",
983 "http://www.bst-elecs.com",
984 "https://www.weld-automation.com",
985 "http://www.suoxuehuwai.com",
986 "http://www.steelprotectionpack.com",
987 "https://twitter.roseconverter.com",
988 "http://mytrickstips.com",
989 "http://binaryoptionsindicators.com",
990 "http://www.jhc-nonwoven.com",
991 "http://www.tjcywires.com",
992 "https://www.wikiplanet.click",
993 "http://infomutt.com",
994 "http://www.nbyobo.com",
995 "http://www.amcbox.com",
996 "http://www.fanhaopets.com",
997 "http://www.supplyfurniture.com",
998 "http://www.ruifeng-leather.com",
999 "https://mi.lawyers.cafe",
1000 "http://www.vango-tech.com",
1001 "http://www.viairdoormat.com",
1002 "https://2fish.co",
1003 "http://atoall.com",
1004 "http://www.qymachines.com",
1005 "https://www.aquark.com.cn",
1006 "http://www.church-of-christ.org",
1007 "http://www.litbright-candles.com",
1008 "https://www.nbwinwinea.com",
1009 "https://www.bestpvcfence.com",
1010 "http://www.chinachairtable.com",
1011 "http://www.zhonghe222.com",
1012 "http://church-of-christ.org",
1013 "http://www.lishin.cc",
1014 "https://www.webhostingsecretrevealed.net",
1015 "http://www.damiser.com",
1016 "http://www.hzzjair.com",
1017 "http://www.sxceramic.com",
1018 "http://www.fxctool.com",
1019 "http://www.livepro-beauty.com",
1020 "https://www.pldyes.com",
1021 "https://vimeo.roseconverter.com",
1022 "http://www.chinapipemills.com",
1023 "http://www.shanghailangzhiweld.com",
1024 "https://mi.kidspicturedictionary.com",
1025 "http://www.ldsolarpv.com",
1026 "https://www.fxcc.com",
1027 "https://www.kubbamachine.com",
1028 "http://www.linbaymachinery.com",
1029 "https://www.axnewdisplay.com",
1030 "http://www.whties.com",
1031 "http://www.homey-tec.com",
1032 "http://www.arjextrailerparts.com",
1033 "http://www.julongjewelry.cn",
1034 "https://www.livehoster.com",
1035 "http://www.risepipe.com",
1036 "http://www.wrdtubemill.com",
1037 "http://www.sunshinebelt.com",
1038 "https://www.yourcloudlibrary.com",
1039 "http://loginmail.online",
1040 "http://www.shengxinsport.com",
1041 "http://www.fxpremiere.com",
1042 "https://www.czzhit.com",
1043 "https://www.king-pcb.com",
1044 "http://www.wpcline.com",
1045 "http://portal.smart-project.info",
1046 "http://www.qxmic.com",
1047 "http://www.luluae.com",
1048 "https://www.datemypet.com",
1049 "http://www.gmk-valve.com",
1050 "https://www.sdspraybooth.com",
1051 "http://www.houshenshoes.com",
1052 "http://www.homewin88.com",
1053 "http://www.sdxhhd.com",
1054 "http://www.bmaxmachine.com",
1055 "http://www.bestwaytowhitenteethguide.org",
1056 "http://www.linphos.com",
1057 "http://www.analiabriz.com",
1058 "http://www.joyseaplywood.com",
1059 "http://www.chinatopcnc.com",
1060 "https://blondewebcamgirl.com",
1061 "http://www.czzhit.com",
1062 "https://www.judipak.com",
1063 "http://www.sindadisplay.com",
1064 "http://www.wellformpacking.com",
1065 "http://www.wosaicabinet.com",
1066 "http://www.windsolarchina.com",
1067 "http://www.sinemagnetic.com",
1068 "http://www.ictctruss.com",
1069 "http://www.shshenyong.com",
1070 "http://www.pvcroofingtile.com",
1071 "http://www.mtpak.com",
1072 "http://www.tubemillcn.com",
1073 "http://www.weldpipemill.com",
1074 "http://www.xida-electronics.com",
1075 "http://www.cnsongben.com",
1076 "https://www.nbkeming.com",
1077 "http://www.jpslurrypump.com",
1078 "http://www.cz-juteng.com",
1079 "https://vk.roseconverter.com",
1080 "http://www.sps-squeegee.com",
1081 "http://mi.broadcastbeat.com",
1082 "https://www.td-casting.com",
1083 "http://milfsplease.com",
1084 "http://www.qbd-group.com",
1085 "http://technobuzzer.com",
1086 "https://www.cz-juteng.com",
1087 "http://www.xfinsulation.com",
1088 "http://www.wavesspring.com",
1089 "http://www.bigrollscloth.com",
1090 "http://www.huamachinery.com",
1091 "http://www.restart-industry.com",
1092 "http://www.shenhe-bearing.com",
1093 "http://www.newbaoquan.com",
1094 "https://follow3rs.com",
1095 "https://www.airpullfilter.com",
1096 "http://www.mao-shuo.com",
1097 "http://mi.hongwugas.com",
1098 "http://www.pamaens.com",
1099 "http://www.weddingfurniture.com",
1100 "http://www.mksmartcard.com",
1101 "http://jobdescriptionsample.org",
1102 "http://www.jbpcba.com.cn",
1103 "https://biblia.gospelprime.com.br",
1104 "https://blockchains.io",
1105 "http://www.qitai-adhesive.com",
1106 "http://www.jindunlaobao.com",
1107 "https://jobdescriptionsample.org",
1108 "https://www.samsungwiremesh.com",
1109 "http://www.eternal-friendship.com",
1110 "http://www.rosin-kings.com",
1111 "https://facebook.roseconverter.com",
1112 "https://www.yogemcasting.com",
1113 "http://www.chinacombinerbox.com",
1114 "https://dwsolo.com",
1115 "http://www.autosunsoul.com",
1116 "https://www.hello4x4.com",
1117 "http://www.silicone-odm.com",
1118 "http://www.wf-fastener.com",
1119 "http://www.czldfloor.com",
1120 "http://www.zjnbzy.com",
1121 "http://www.secondhormone.com",
1122 "http://www.artmetalcn.com",
1123 "http://www.ycautoc.com",
1124 "http://www.chinacarbonfibre.com",
1125 "https://guidebooq.com"
1126 ]
1127}
1128
1129/* 2 */
1130{
1131 "_id" : "CN",
1132 "count" : 118.0,
1133 "domain" : [
1134 "https://www.qlart.com",
1135 "https://www.grandstarcn.com",
1136 "https://www.valve-pipe-fitting.com",
1137 "http://www.wedacdisplays.com",
1138 "http://www.goldenlaser.cc",
1139 "https://www.cntfsolar.com",
1140 "http://www.abdindustrial.com",
1141 "http://www.koowheel.com",
1142 "https://www.gaofeng-petro.com",
1143 "https://www.nbhengchen.com",
1144 "http://www.jsbotanics.com",
1145 "https://www.simphoenix.com",
1146 "https://www.bestardoors.com",
1147 "https://www.n2o2gas.com",
1148 "https://www.charmingmetal.com",
1149 "https://www.fc-med.com",
1150 "http://www.focuslasersystems.com",
1151 "https://www.nfyo.com",
1152 "http://www.zypackag.com",
1153 "http://www.kavounautoparts.com",
1154 "https://www.jsjlmachinery.com",
1155 "https://www.tjtgsteel.com",
1156 "https://www.yangrutingtrade.com",
1157 "https://www.c-superun.com",
1158 "https://www.lasonparts.com",
1159 "https://www.special-metal.com",
1160 "https://www.szhtpmart.com",
1161 "https://www.chinarfidcard.com",
1162 "https://www.ez-walk.com",
1163 "https://www.diamante-tech.com",
1164 "https://www.sino-masterbatch.com",
1165 "https://www.medke.com",
1166 "https://www.dm-compressor.com",
1167 "https://www.haitungchem.com",
1168 "http://www.wenwencf.com",
1169 "https://www.peptidejymed.com",
1170 "https://www.slagremoving.com",
1171 "https://www.chinanbdb.com",
1172 "http://www.gmmdjx.com",
1173 "https://www.richest-group.com",
1174 "http://www.world-starter.com",
1175 "http://www.medicohongkong.com",
1176 "http://www.jetwayamenities.com",
1177 "https://www.abdindustrial.com",
1178 "https://www.artiegarden.com",
1179 "https://www.outstandingdm.com",
1180 "https://www.aoxinhvacr.com",
1181 "https://www.safesworld.com",
1182 "https://www.ngyc.com",
1183 "https://www.szradiant.com",
1184 "https://www.3drambery.com",
1185 "https://www.xianglin-plastics.com",
1186 "http://www.cntiescarf.com",
1187 "https://www.aerial-display.com",
1188 "https://www.imposalight.com",
1189 "https://www.pacopower.com",
1190 "http://www.eburn-burner.com",
1191 "https://www.szzhsbag.com",
1192 "https://www.phhydraulic.com",
1193 "https://www.bofanpc.com",
1194 "http://www.comfortebicycle.com",
1195 "http://www.3drambery.com",
1196 "https://www.pakite.com",
1197 "https://www.inductorchina.com",
1198 "https://www.aootan.com",
1199 "https://www.micropreparedslides.com",
1200 "https://www.tianjia-lock.com",
1201 "https://english.taiergroup.com",
1202 "https://www.hytokstech.com",
1203 "http://www.czhengfa.com",
1204 "http://www.ankaicnc.com",
1205 "https://www.nbulboy.com",
1206 "http://www.eudemonbaby.com",
1207 "http://www.coneleqd.com",
1208 "https://www.band-ss.com",
1209 "https://www.coffbrewing.com",
1210 "https://www.km-medicine.com",
1211 "https://www.jy-glass.com",
1212 "https://www.changjia-machinery.com",
1213 "https://www.zengrit.com",
1214 "http://www.prius-automatic.com",
1215 "https://www.sitzonechair.com",
1216 "https://www.goldnard.com",
1217 "https://www.bescatray.com",
1218 "http://www.qjqdvalve.com",
1219 "http://www.yulong-cellulose-cmc.com",
1220 "https://www.sakysteel.com",
1221 "https://www.tianseoffice.com",
1222 "http://www.likvchina.com",
1223 "https://www.sehenda-en.com",
1224 "http://www.nbwellrun.com",
1225 "https://www.painting-machine.com",
1226 "https://www.sdtoplit.com",
1227 "https://www.jewellrylove.com",
1228 "https://www.fibereye2.com",
1229 "https://www.dghk-buffer.com",
1230 "https://www.rykay.com",
1231 "https://www.wecare-life.com",
1232 "https://www.foocles.com",
1233 "http://www.estarspareparts.com",
1234 "https://www.study-mandarin.com",
1235 "https://www.dshprecision.com",
1236 "https://www.jsbotanics.com",
1237 "https://www.zhongxinlighting.com",
1238 "http://www.refinehotelsupply.com",
1239 "http://www.longtopmining.com",
1240 "https://www.insharevape.com",
1241 "https://www.xinyuesteel.com",
1242 "https://www.herbal-ingredients.com",
1243 "http://www.wigglewires.com",
1244 "https://www.bailixin.com",
1245 "https://www.egbadges.com",
1246 "https://www.qdruidetai.com",
1247 "https://www.sjzhgw.com",
1248 "https://www.zjyongqi.com",
1249 "https://www.rswires.com",
1250 "https://www.chinawelken.com",
1251 "https://www.nbjiatong.com"
1252 ]
1253}
1254
1255/* 3 */
1256{
1257 "_id" : "FR",
1258 "count" : 19.0,
1259 "domain" : [
1260 "https://mi.hyperbaric-chamber.com",
1261 "https://mi.mehmetdursun.av.tr",
1262 "https://www.planetkeyboard.com",
1263 "https://mi.mhthread.com",
1264 "https://mi.gem.agency",
1265 "http://mi.outboard-boat-motor-repair.com",
1266 "https://www.slotsltd.com",
1267 "http://www.gpedia.com",
1268 "http://mi.aasraw.com",
1269 "http://mi.fitnessrebates.com",
1270 "https://mi.petrpikora.com",
1271 "https://mi.phcoker.com",
1272 "https://www.casino.uk.com",
1273 "https://mi.hghphuket.com",
1274 "https://mi.apicmo.com",
1275 "https://mi.isearch.de",
1276 "https://www.expresscasino.com",
1277 "https://mi.usa-casino-online.com",
1278 "http://mi.psychicbonus.com"
1279 ]
1280}
1281
1282/* 4 */
1283{
1284 "_id" : "DE",
1285 "count" : 7.0,
1286 "domain" : [
1287 "https://afrikhepri.org",
1288 "https://mi.vessoft.com",
1289 "http://transposh.org",
1290 "https://transposh.org",
1291 "https://www.saper-link-news.com",
1292 "https://herocity.de",
1293 "https://traynews.com"
1294 ]
1295}
1296
1297/* 5 */
1298{
1299 "_id" : "NL",
1300 "count" : 6.0,
1301 "domain" : [
1302 "http://www.cbdolievoordelen.nl",
1303 "https://www.emergency-live.com",
1304 "http://www.martinvrijland.nl",
1305 "https://realtytenerife.com",
1306 "https://www.bitbybitbook.com",
1307 "http://www.spectrumschool.be"
1308 ]
1309}
1310
1311/* 6 */
1312{
1313 "_id" : "UNKNOWN",
1314 "count" : 3.0,
1315 "domain" : [
1316 "https://mi.buyaas.com",
1317 "https://www.hjfoodmachinery.com",
1318 "https://www.desunpump.com"
1319 ]
1320}
1321
1322/* 7 */
1323{
1324 "_id" : "CA",
1325 "count" : 3.0,
1326 "domain" : [
1327 "https://cloudsfeed.com",
1328 "http://dehaut.com",
1329 "http://newsrule.com"
1330 ]
1331}
1332
1333/* 8 */
1334{
1335 "_id" : "UA",
1336 "count" : 2.0,
1337 "domain" : [
1338 "http://ukraine.admission.center",
1339 "http://umsa.admission.center"
1340 ]
1341}
1342
1343/* 9 */
1344{
1345 "_id" : "GB",
1346 "count" : 2.0,
1347 "domain" : [
1348 "https://www.centrallanguageschool.com",
1349 "https://www.solasolv.com"
1350 ]
1351}
1352
1353/* 10 */
1354{
1355 "_id" : "AU",
1356 "count" : 1.0,
1357 "domain" : [
1358 "http://www.almancax.com"
1359 ]
1360}
1361
1362/* 11 */
1363{
1364 "_id" : "SG",
1365 "count" : 1.0,
1366 "domain" : [
1367 "https://omg-solutions.com"
1368 ]
1369}
1370
1371/* 12 */
1372{
1373 "_id" : "EU",
1374 "count" : 1.0,
1375 "domain" : [
1376 "http://www.the-good-stuff-factory.be"
1377 ]
1378}
1379
1380/* 13 */
1381{
1382 "_id" : "RU",
1383 "count" : 1.0,
1384 "domain" : [
1385 "http://www.treningmozga.com"
1386 ]
1387}
1388
1389/* 14 */
1390{
1391 "_id" : "HK",
1392 "count" : 1.0,
1393 "domain" : [
1394 "http://www.allutertech.com"
1395 ]
1396}
1397
1398/* 15 */
1399{
1400 "_id" : "IE",
1401 "count" : 1.0,
1402 "domain" : [
1403 "http://netkiosk.co.uk"
1404 ]
1405}
1406
1407/* 16 */
1408{
1409 "_id" : "TR",
1410 "count" : 1.0,
1411 "domain" : [
1412 "https://www.elitedeluxe.com.tr"
1413 ]
1414}
1415
1416/* 17 */
1417{
1418 "_id" : "JP",
1419 "count" : 1.0,
1420 "domain" : [
1421 "https://forexmania.org"
1422 ]
1423}
1424
1425/* 18 */
1426{
1427 "_id" : "ES",
1428 "count" : 1.0,
1429 "domain" : [
1430 "https://www.torresbus.es"
1431 ]
1432}
1433
1434/* 19 */
1435{
1436 "_id" : "SE",
1437 "count" : 1.0,
1438 "domain" : [
1439 "http://en.wiki.wintoflash.com"
1440 ]
1441}
1442
1443
1444First, I eyeballed and excluded all obvious product sites which are automatically translated.
1445
1446Of interest or possible interest remain the following, grouped per country of site origin:
1447
1448US:
1449+ GAINED FROM AU: https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]
1450
1451!! http://indigenousblogs.com [15/18 blogs work] - has one page in Maori (http://indigenousblogs.com/feeds/mi.xml)
1452X https://biblia.gospelprime.com.br - misdetection (containsMRI)
1453X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout
1454!! https://mi.m.wikipedia.org, https://mi.wikipedia.org
1455X https://usahello.org - autotranslated
1456X http://church-of-christ.org, http://www.church-of-christ.org - I think autotranslated, because "HET kerken van Christus" at https://church-of-christ.org/nl/ i.p.v. meervoud DE
1457X https://www.livehoster.com
1458X http://www.americasportsfloor.com, - product store. Misdetected
1459!! http://csunplugged.org, https://www.csunplugged.org - University of Canterbury NZ and site only available in EN, MI, DE, ES, CN
1460X https://mi.lawyers.cafe - autotranslated
1461 X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated
1462! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated.
1463X http://jobdescriptionsample.org - autotranslated
1464X http://mi.broadcastbeat.com - autotranslated product site
1465X http://www.samewe.net - autotranslated product site
1466X https://mi.kidspicturedictionary.com - autotranslated, but MAY BE USEFUL
1467X https://www.rikoooo.com - autotranslated
1468
1469CN: -
1470
1471FR:
1472? https://mi.phcoker.com - product site "Shangke Chemical Rapu + 86 (1812) 4514114 [email protected]"
1473X http://www.gpedia.com - dodgy copy of wikipedia, see http://www.gpedia.com/nl/gpedia/Hoofdpagina
1474
1475NL:
1476X http://www.martinvrijland.nl - wordpress, autotranslated
1477
1478CA:
1479X https://www.wikiplanet.click (seems like a dodgy copy of wikipedia)
1480X cloudsfeed.com - wordpress admin page
1481
1482
1483db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]})
1484=> http://indigenousblogs.com/mi/
1485
1486
1487TOTAL: Only 4 sites contain genuine MRI sentences that aren't automatically translated out of all non-NZ/non-AU sites that have "mi" in a webpage's URL path.
1488
1489
1490TOTALS:
1491US: 26+5 from US with mi in URL path = 31
1492AU: 1
1493DE: 2
1494DK: 2
1495BG: 1
1496CZ: 1
1497ES: 1
1498FR: 1
1499IE: 1
1500TOTAL: 212+5 from US with mi in URL path = 217
1501
1502------------------------------------------------
1503B. NEW ZEALAND SITES: NZ origin + .nz TLD SITES
1504------------------------------------------------
15051. Get NZ sites numPagesContainingMRI > 0
1506
1507// To list domains in alphabetical order, which addToSet doesn't do, see
1508// https://stackoverflow.com/questions/21967233/sorting-aggregation-addtoset-result
1509
1510db.Websites.aggregate([
1511 {
1512 $match: {
1513 $and: [
1514 {numPagesContainingMRI: {$gt: 0}},
1515 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1516 ]
1517 }
1518 },
1519 { $unwind: "$geoLocationCountryCode" },
1520 {
1521 $group: {
1522 _id: "nz",
1523 count: { $sum: 1 },
1524 domain: {$push: "$basicDomain" }, /*domain: { $addToSet: '$domain' },*/
1525 /*numPagesInMRICount: { $sum: '$numPagesInMRI' },
1526 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }*/
1527 }
1528 },
1529 { $sort : { count : -1} }
1530 ]);
1531
1532165 UNIQUE SITE DOMAINS (NZ).
1533
1534/* 1 */
1535{
1536 "_id" : "nz",
1537 "count" : 182.0,
1538 "domain" : [
1539 "anglicanprayerbook.nz",
1540 "arataua.nz",
1541 "archerpix.com",
1542 "archive.electionresults.govt.nz",
1543 "archive.stats.govt.nz",
1544 "artizani.co.nz",
1545 "auturoa.nz",
1546 "avonside.net",
1547 "biketorqueyamaha.co.nz",
1548 "community.nzdl.org",
1549 "conference.tpwt.maori.nz",
1550 "crimson.co.nz",
1551 "dev.nzpcn.org.nz",
1552 "firstworldwar.tki.org.nz",
1553 "hana.co.nz",
1554 "hangaraumatihiko.tki.org.nz",
1555 "kaiiwicamp.nz",
1556 "kaupare.co.nz",
1557 "kmpmusic.co.nz",
1558 "kuraaiwi.maori.nz",
1559 "kurakokiri.maori.nz",
1560 "kuraproductions.co.nz",
1561 "kurataiao.tki.org.nz",
1562 "maori.livingheritage.org.nz",
1563 "maori.tki.org.nz",
1564 "myfathersworld.net.nz",
1565 "ngarauhuia.ngatiapakiterato.iwi.nz",
1566 "ngatipahauwera.co.nz",
1567 "ngatiporoukiponeke.org.nz",
1568 "ngatiwhakaue.iwi.nz",
1569 "nzpostcard.co.nz",
1570 "oilcrash.com",
1571 "otorohanga.directorybusiness.co.nz",
1572 "philipbeadle.co.nz",
1573 "pukapuka.nz",
1574 "pukekohe.directorybusiness.co.nz",
1575 "pukoro.co.nz",
1576 "punareo.co.nz",
1577 "rakaumanga.school.nz",
1578 "rexedra.gen.nz",
1579 "rsnz.natlib.govt.nz",
1580 "rurued.school.nz",
1581 "satellites.co.nz",
1582 "southerntribes.co.nz",
1583 "cms.sunsmartschools.co.nz",
1584 "talkingtothecan.com",
1585 "teaohou.natlib.govt.nz",
1586 "tehauora.org.nz",
1587 "temahurehure.maori.nz",
1588 "animations.tewhanake.maori.nz",
1589 "tiritiowaitangi.govt.nz",
1590 "tmoa.tki.org.nz",
1591 "w3vietnam.org.nz",
1592 "waiata.maori.nz",
1593 "waitarahistory.org.nz",
1594 "kete.wcl.govt.nz",
1595 "whatonga.school.nz",
1596 "biketorqueyamaha.co.nz",
1597 "brettgraham.co.nz",
1598 "finlaysonpark.school.nz",
1599 "firstworldwar.tki.org.nz",
1600 "gans.co.nz",
1601 "huri-translations.pf",
1602 "jeremybaker.nz",
1603 "kkmmaungarongo.co.nz",
1604 "kmk.maori.nz",
1605 "kura-porirua.school.nz",
1606 "kurakokiri.maori.nz",
1607 "livingheritage.org.nz",
1608 "matarikifestival.org.nz",
1609 "methodist.org.nz",
1610 "ngamanawainc.co.nz",
1611 "nzpcn.org.nz",
1612 "otepoti.school.nz",
1613 "pakanae.maori.nz",
1614 "rakaumanga.school.nz",
1615 "rotoruanz.com",
1616 "runanga.co.nz",
1617 "ruralfind.co.nz",
1618 "tasteofplenty.co.nz",
1619 "teipukarea.maori.nz",
1620 "temarareo.org",
1621 "tereowrap.nz",
1622 "tetaumuturunanga.iwi.nz",
1623 "tewhanake.maori.nz",
1624 "tkkmmokopuna.school.nz",
1625 "tmoa.tki.org.nz",
1626 "topomap.co.nz",
1627 "tuwharetoa.iwi.nz",
1628 "twtop.school.nz",
1629 "w3vietnam.org.nz",
1630 "waiata.maori.nz",
1631 "wcl.govt.nz",
1632 "writersfestival.co.nz",
1633 "zoomin.co.nz",
1634 "2019.nethui.nz",
1635 "28maoribattalion.org.nz",
1636 "admin.teara.govt.nz",
1637 "curriculumtool.education.govt.nz",
1638 "videos.e-agent.nz",
1639 "e-ako-pangarau.nzmaths.co.nz",
1640 "archive.electionresults.govt.nz",
1641 "givealittle.co.nz",
1642 "haereheikaiako.co.nz",
1643 "hepatakakupu.nz",
1644 "holyspirit.nz",
1645 "interactives.stuff.co.nz",
1646 "kaiiwicamp.nz",
1647 "keepourmoneyclean.govt.nz",
1648 "kotahimiriona.co.nz",
1649 "kupengahao.co.nz",
1650 "liveresults.co.nz",
1651 "m.wairarapatv.co.nz",
1652 "manawatuheritage.pncc.govt.nz",
1653 "maoriinvestments.co.nz",
1654 "oag.govt.nz",
1655 "office.e-agent.nz",
1656 "paekupu.co.nz",
1657 "player.vimeo.com",
1658 "rapuatearatika.education.govt.nz",
1659 "register.tpota.org.nz",
1660 "rehuamarae.co.nz",
1661 "reoora.co.nz",
1662 "sexualviolence.victimsinfo.govt.nz",
1663 "sooty.nz",
1664 "teaomaori.news",
1665 "blog.teara.govt.nz",
1666 "cdn.tehiku.nz",
1667 "tetaurawhiri.govt.nz",
1668 "tewikiotereomaori.nz",
1669 "tiritiowaitangi.govt.nz",
1670 "tmmkkm.school.nz",
1671 "ttw1.cwp.govt.nz",
1672 "ashtangatauranga.co.nz",
1673 "blushandbrows.nz",
1674 "components-mart.nz",
1675 "cruisetourstauranga.co.nz",
1676 "cs.waikato.ac.nz",
1677 "dnc.org.nz",
1678 "e-agent.nz",
1679 "electionresults.govt.nz",
1680 "electionresults.org.nz",
1681 "eventcinemas.co.nz",
1682 "hapuhauora.health.nz",
1683 "heartland.co.nz",
1684 "hrc.co.nz",
1685 "infinite-electronic.nz",
1686 "komako.org.nz",
1687 "korokikahukura.co.nz",
1688 "lcds-display.nz",
1689 "maoriinvestments.co.nz",
1690 "maoritelevision.com",
1691 "matarikifestival.org.nz",
1692 "ngamanawainc.co.nz",
1693 "oag.govt.nz",
1694 "pinterest.ca",
1695 "pinterest.co.uk",
1696 "pinterest.fr",
1697 "pinterest.it",
1698 "pinterest.jp",
1699 "pinterest.nz",
1700 "puau.school.nz",
1701 "puhaandpakeha.co.nz",
1702 "rereahu.maori.nz",
1703 "rotorua-rafting.co.nz",
1704 "rotoruanz.com",
1705 "sporty.co.nz",
1706 "stats.govt.nz",
1707 "taitokerautrust.org.nz",
1708 "takitimu.ac.nz",
1709 "tasteofplenty.co.nz",
1710 "tekura.school.nz",
1711 "tematawai.maori.nz",
1712 "terakipaewhenua.school.nz",
1713 "terito.school.nz",
1714 "tetaurawhiri.govt.nz",
1715 "tewikiotereomaori.co.nz",
1716 "tuiatematangi.ac.nz",
1717 "whanau-tahi.school.nz",
1718 "wingspan.co.nz",
1719 "zenbu.co.nz",
1720 "za.pinterest.com"
1721 ],
1722 "numPagesInMRICount" : 4360,
1723 "numPagesContainingMRICount" : 9687
1724}
1725
1726
1727NZ sites where pages are detected as being overall inMRI are more likely to contain at least one sentence inMRI.
1728Therefore, for the purpose of making the manual task of going through all NZ sites a bit easier,
1729will work with 2 query results that combine into the above:
1730- those NZ pages where numPagesInMRI > 0
1731- and the remaining NZ pages that only contain MRI (numPagesInMRI = 0 but numPagesContainingMRI > 0)
1732
1733----------------------------
1734
17352. Get NZ sites where numPagesInMRI > 0
1736
1737db.Websites.aggregate([
1738 {
1739 $match: {
1740 $and: [
1741 {numPagesInMRI: {$gt: 0}},
1742 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1743 ]
1744 }
1745 },
1746 { $unwind: "$geoLocationCountryCode" },
1747 {
1748 $group: {
1749 _id: "nz",
1750 count: { $sum: 1 },
1751 domain: { $addToSet: '$domain' },
1752 numPagesInMRICount: { $sum: '$numPagesInMRI' },
1753 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
1754 }
1755 },
1756 { $sort : { count : -1} }
1757]);
1758
1759
1760Annotating the matching domain listing as follows:
1761* First column: n pages that are in MRI / n sampled isMRI pages
1762 To check a site contains a positive number of pages in MRI:
1763 db.getCollection('Webpages').find({URL:/teipukarea\.maori\.nz/, isMRI: true})
1764* Second column: n pages that do contain MRI / n sampled pages that are not isMRI yet contain MRI
1765 Can find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI.
1766 db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true})
1767
1768
1769/* 1 */
1770{
1771 "_id" : "nz",
1772 "count" : 96.0,
1773 "domain" : [
1774 "http://www.teipukarea.maori.nz", 3/3 1/3
1775 "http://ngatipahauwera.co.nz", 2/2, 2/2
1776 "http://www.oag.govt.nz", 2/2 0/2
1777 "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3
1778 "http://tmoa.tki.org.nz", 3/3 3/3
1779 "http://www.tewhanake.maori.nz", 3/3 2/3
1780 "http://www.matarikifestival.org.nz", 4/4 0/3
1781 "http://www.otepoti.school.nz", 3/3 0/4
1782!! "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages]
1783 "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages]
1784 "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence]
1785X!! "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI album and song titles] 0 [no other pages containsMRI]
1786 "http://maori.livingheritage.org.nz", 2/2 2/2 {includes: http://www.livingheritage.org.nz}
1787 "http://pukoro.co.nz", 2/2 0/2
1788X "https://register.tpota.org.nz", 0/1 [form] 0/2
1789+ "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz", 0/4, 1/3 [but audio content may be in MRI] But there are pages containing MRI to be found by non-random sampling of tehiku.nz, e.g. https://tehiku.nz/te-hiku-radio/te-tangihanga-ki-a-erima-henare/ contains MRI sentences
1790!! "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages]
1791! "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3
1792 "http://kurataiao.tki.org.nz", 3/3, 1/total 3
1793
1794!! "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages]
1795 "http://teaohou.natlib.govt.nz", 4/4, 2/4
1796 "http://www.tuwharetoa.iwi.nz", 2/3 0/3
1797+ "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html)
1798 "https://www.terito.school.nz", 3/3, 0/2 total
1799 "https://ttw1.cwp.govt.nz", 3/3 3/3
1800 "https://www.whanau-tahi.school.nz", 4/4, 1/2 total
1801 "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total
1802 "https://teaomaori.news", 3/3, 0/1 total
1803 "http://tetaurawhiri.govt.nz", 3/3 /3/3 [Māori Language Commission site]
1804 "https://www.tuiatematangi.ac.nz", 4/4 3/3
1805 "http://animations.tewhanake.maori.nz", 3/3 3/3
1806!! "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages]
1807!! "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages]
1808 "http://www.28maoribattalion.org.nz", 3/3, 1/3
1809 "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3
1810 "http://www.brettgraham.co.nz", 1/1 total, 0/3
1811!! "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages]
1812
1813 "http://anglicanprayerbook.nz", 3/3 3/3
1814 "http://arataua.nz", 4/4, 2/3
1815 "http://maori.tki.org.nz", 3/3 3/3
1816DONE (with/out www): "http://www.firstworldwar.tki.org.nz",
1817X "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages]
1818 "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages]
1819 "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages]
1820 "https://curriculumtool.education.govt.nz", 4/4, 3/3
1821 "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] {includes: "http://www.kurakokiri.maori.nz"}
1822 "http://www.kkmmaungarongo.co.nz", 3/3, 3/3
1823 "http://www.heartland.co.nz", 3/3, 1/1 total
1824 "http://oilcrash.com", 2/2 total, 0/3
1825 "http://www.kura-porirua.school.nz", 4/4, 2/3
1826 "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages]
1827 "https://www.tematawai.maori.nz", 3/3, 3/3
1828
1829+ "https://www.terakipaewhenua.school.nz",
1830+ "http://www.tetaurawhiri.govt.nz",
1831+ "http://archive.stats.govt.nz", (1 page isMRI)
1832+ "http://tiritiowaitangi.govt.nz",
1833+!! "http://www.waiata.maori.nz", {includes: "http://waiata.maori.nz"}
1834+ "http://hana.co.nz", [crawled version of page contains MRI sentences, but current page is labelled for mobiles whereas in browser it just has a picture]
1835+ "http://kaupare.co.nz",
1836+ "http://www.tereowrap.nz",
1837?X "https://www.e-agent.nz", [autotranslated? SEO related site in NZ, Chinese, English and MRI] {includes: "https://office.e-agent.nz"}
1838 { included: "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] }
1839+ "http://www.hrc.co.nz",
1840+ "http://ngatiporoukiponeke.org.nz",
1841
1842+ "http://rurued.school.nz",
1843+ "http://www.twtop.school.nz",
1844X "https://www.infinite-electronic.nz", [autotranslated product site]
1845+!! "http://www.huri-translations.pf",
1846+ "https://admin.teara.govt.nz", e.g. https://admin.teara.govt.nz/mi/biographies/4m56/moko-pita-te-turuki-tamati {included: "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz, e.g. https://teara.govt.nz/mi/biographies/1t28/te-hapuku/media]}
1847+!! "https://tiritiowaitangi.govt.nz",
1848+ "http://www.tmoa.tki.org.nz",
1849+ "https://www.komako.org.nz", [no longer available, but crawled page had a paragraph in Maori with presumably a translation thereafter]
1850+ "http://www.wcl.govt.nz", {included: "http://kete.wcl.govt.nz" as wcl.govt.nz; 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3}
1851+!! "http://punareo.co.nz", [waiata]
1852
1853+ "https://rapuatearatika.education.govt.nz",
1854+ "http://tmmkkm.school.nz",
1855X "https://www.components-mart.nz", [autotranslated product site]
1856+ "http://www.cs.waikato.ac.nz", [Te Taka's pages!]
1857+!!! "http://www.kupengahao.co.nz", [MRI language books and resources]
1858+ "https://www.hapuhauora.health.nz", [Smokefree site. The one isMRI page has a proper sentence.]
1859X "https://www.lcds-display.nz", [autotranslated product site]
1860+ "http://cms.sunsmartschools.co.nz", [as sunsmartschools.co.nz, e.g. http://sunsmartschools.co.nz/kura/how-to-become.html]
1861+ "http://kuraproductions.co.nz",
1862+ "https://keepourmoneyclean.govt.nz", [1 page]
1863
1864+!! "http://www.tekura.school.nz",
1865+ "http://www.tkkmmokopuna.school.nz", [e.g. http://www.tkkmmokopuna.school.nz/newsletter_sets/newsletters/13-wahanga-4-wiki-5-he-puna-korero]
1866+ "http://hangaraumatihiko.tki.org.nz", [govt website. e.g. http://hangaraumatihiko.tki.org.nz/te-tupuranga-tangata-me-te-rorohiko/whakatupuranga-3-te-whakatau-tikanga-kawe-i-nga-mahi-whakaoti-hopanga/]
1867+ "http://www.pakanae.maori.nz"
1868 ],
1869 "numPagesInMRICount" : 4360,
1870 "numPagesContainingMRICount" : 7968
1871}
1872
1873
187496 sites detected as having isMRI pages - 7 sites/subdomains already included in the existing 96 = 89 sites.
1875
1876-2.5* product sites -2 non-MRI sites with songlistings or web forms etc
1877 *0.5 for e-agent.nz site
1878= 84.5 sites total that at least contain MRI, most have pages inMRI.
1879
1880We are excluding the one marked with ?X as it appears autotranslated.
1881In this set then, there are 84 sites that at least contain MRI out of 89 unique sites detected as containing pages inMRI.
1882
1883If not counting unique sites but counting the mongdb query result's subdomains separately: 84 +4 sites (non-unique or split over subdomains) in the result set contained MRI = 88 sites.
1884
1885----------------------------
1886
18873. Handling the remainder: NZ sites where numPagesInMRI = 0 BUT numPagesContainingMRI > 0
1888
1889The remainder = 80 NZ sites detected as not containing pages InMRI, but with positive number of pages detected as containsMRI:
1890
1891db.Websites.aggregate([
1892 {
1893 $match: {
1894 $and: [
1895 {numPagesContainingMRI: {$gt: 0}},
1896 {numPagesInMRI: {$eq: 0}},
1897 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1898 ]
1899 }
1900 },
1901 { $unwind: "$geoLocationCountryCode" },
1902 {
1903 $group: {
1904 _id: "nz",
1905 count: { $sum: 1 },
1906 domain: { $addToSet: '$domain' },
1907 numPagesInMRICount: { $sum: '$numPagesInMRI' },
1908 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
1909 }
1910 },
1911 { $sort : { count : -1} }
1912]);
1913
1914
1915Find pages for testing with:
1916 db.getCollection('Webpages').find({URL:/ashtangatauranga\.co\.nz/, containsMRI: true, mriSentenceCount: {$gt: 0}})
1917
1918
1919/* 1 */
1920{
1921 "_id" : "nz",
1922 "count" : 80.0,
1923 "domain" : [
1924X "http://www.zoomin.co.nz", [map site, so placenames]
1925X "http://www.biketorqueyamaha.co.nz", [placenames] {includes "http://biketorqueyamaha.co.nz"}
1926X "http://archerpix.com", [photo captions containing placenames]
1927X "http://philipbeadle.co.nz", [art captions containing placenames]
1928X "https://2019.nethui.nz", [Just MRI words in ENG sentences]
1929X "http://crimson.co.nz", [address]
1930+ "http://holyspirit.nz", (e.g. https://holyspirit.nz/wp-content/uploads/02-25-2018-Newsletter.pdf)
1931X "https://www.wingspan.co.nz", [1 page, 1 conjoined word, looks like placename]
1932X "http://nzpostcard.co.nz", [postcards with placenames]
1933+ "https://www.ngamanawainc.co.nz", (e.g. https://www.ngamanawainc.co.nz/history/id/60) {includes "http://www.ngamanawainc.co.nz"}
1934
1935+ "http://www.finlaysonpark.school.nz", [e.g. http://www.finlaysonpark.school.nz/58/pages/61-team-overview; Some Tongan language pages and Samoan pages]
1936X "http://artizani.co.nz", [address]
1937+ "http://www.w3vietnam.org.nz", [e.g. http://www.w3vietnam.org.nz/w3%20mihi%20ki%20ka%20hoia.htm] (includes "http://w3vietnam.org.nz")
1938X "https://sooty.nz", [names, war death notices, place names]
1939X? "http://rakaumanga.school.nz", [School has a longer Maori title name, but that's it] {includes "http://www.rakaumanga.school.nz"}
1940X "http://www.rotoruanz.com", [e.g. https://www.rotoruanz.com/RNZ/media/Media-Library/SOI-Board-version-08-June-2018-final-published-on-website.pdf]
1941X "https://www.cruisetourstauranga.co.nz", [English with Tauranga being MRI placename]
1942X "http://www.jeremybaker.nz", [one word, HOkio]
1943
1944X "https://liveresults.co.nz", [canoe sports team names]
1945X "http://rexedra.gen.nz", [ENG sentence with MRI words]
1946+ "https://www.takitimu.ac.nz", [school of performing arts, example MRI sentence at https://www.takitimu.ac.nz/about-us]
1947X "http://www.electionresults.govt.nz", [placenames and misdetection of Return to Home Page] {includes https://www.electionresults.org.nz which seems to be an alias for the same} {includes "http://archive.electionresults.govt.nz"}
1948+ "https://kotahimiriona.co.nz", e.g. (https://kotahimiriona.co.nz/about-the-app/)
1949+ "https://rehuamarae.co.nz", (e.g. https://rehuamarae.co.nz/)
1950+ "http://reoora.co.nz", (e.g. https://reoora.co.nz/about/)
1951
1952X "http://otorohanga.directorybusiness.co.nz", [placenames]
1953X "http://waitarahistory.org.nz", [placenames and misdetection of sentences of the form "31, No 262." as MRI]
1954+ "https://manawatuheritage.pncc.govt.nz", (e.g. https://manawatuheritage.pncc.govt.nz/about)
1955+ "http://rsnz.natlib.govt.nz", [e.g. http://rsnz.natlib.govt.nz/volume/rsnz_26/rsnz_26_00_004560.html] NOTE: site appears to use Greenstone
1956X "https://www.rotorua-rafting.co.nz", [placenames]
1957+ "https://www.taitokerautrust.org.nz", (e.g. https://www.taitokerautrust.org.nz/)
1958+ "http://tewikiotereomaori.nz", (e.g. https://tewikiotereomaori.nz/about/)
1959+ "https://www.korokikahukura.co.nz", (e.g. https://www.korokikahukura.co.nz/ko-wai-m257tou.html about the connection to the Waikato River)
1960
1961X "https://www.puhaandpakeha.co.nz", [ENG sentences with MRI words]
1962X "http://myfathersworld.net.nz", [placenames]
1963X "https://www.ashtangatauranga.co.nz", [misdetection]
1964+ "https://www.pinterest.nz", (e.g. sentence "E tu Whare Rangi, e tu Whare Pataka, kia ora koutou nga Whare Tipuna o te Ao Maori." at https://www.pinterest.nz/amp/marylizw03/maori/)
1965+ "https://www.rereahu.maori.nz", (e.g. https://www.rereahu.maori.nz/uploads/72717/files/168860/Te_Maru_o_Rereahu_Newsletter_No_5_February_2011.pdf)
1966+ "http://givealittle.co.nz", (e.g. https://givealittle.co.nz/cause/rebuild-tapu-te-ranga-marae which contains greetings phrases and sentence ""Nā te ringa tangata i hanga te whare Nā te tuarā o te whare i whakatipu i te tangata")
1967X "http://www.gans.co.nz", [placenames]
1968+ "https://kaiiwicamp.nz", [placenames] {includes "http://kaiiwicamp.nz"}
1969+ "http://ngarauhuia.ngatiapakiterato.iwi.nz", (e.g. http://ngarauhuia.ngatiapakiterato.iwi.nz/pdf/Nga-Rau-Huia_Raumati_2015_Issue1.pdf)
1970+ "https://m.wairarapatv.co.nz", (e.g. https://m.wairarapatv.co.nz/archive/i/JGmJqCZCNK8/te-ara-whanui-kura-kaupapa-maori-o-nga-khanga-reo-o-te-awa-kairangi)
1971
1972X "http://www.methodist.org.nz", [ENG sentence with MRI words]
1973+ "http://avonside.net", (e.g. http://avonside.net/Nicky/indexmain.htm)
1974X "http://www.ruralfind.co.nz", [placenames]
1975+ "http://www.maoriinvestments.co.nz", (e.g. Vision, Mission & Values tab of https://maoriinvestments.co.nz/about/organisation)
1976+ "http://conference.tpwt.maori.nz", (e.g. https://www.tpwt.maori.nz/te-mahe-matauranga/)
1977+ "https://www.puau.school.nz", (e.g. https://www.puau.school.nz/k%C4%81inga-home)
1978+? "http://ngatiwhakaue.iwi.nz", (e.g. greetings at http://ngatiwhakaue.iwi.nz/)
1979X "http://www.nzpcn.org.nz", [lots of misdetections and one sentence in English with MRI words, e.g. "Kaori is a Taonga to Maori".] {includes "http://dev.nzpcn.org.nz"}
1980+? "https://interactives.stuff.co.nz", [2 x page mostly empty except for the title "TE WIKI O TE REO MĀORI Māori"]
1981+ "http://tehauora.org.nz", (e.g. http://tehauora.org.nz/ and proverb at bottom of http://tehauora.org.nz/about-us)
1982
1983+ "http://temahurehure.maori.nz", (e.g. greeting http://temahurehure.maori.nz/site/file/temp/hirebook.pdf)
1984X "http://pukekohe.directorybusiness.co.nz", [placenames]
1985+!! "http://www.temarareo.org", (dictionary with sample sentences. e.g. http://www.temarareo.org/PAPAKUPU/dictionary-searchresults/A-2.htm)
1986X "https://www.tasteofplenty.co.nz", [Tauranga placename and misdetection of "No one went away hungry] {includes "http://www.tasteofplenty.co.nz"}
1987
1988+ "http://www.tetaumuturunanga.iwi.nz", (e.g. http://www.tetaumuturunanga.iwi.nz/wp-content/uploads/2016/04/Malvern-Cultural-Narrative-Draft.pdf)
1989
1990
1991X "https://www.blushandbrows.nz", [misdetection of "Makeup..."]
1992X "http://talkingtothecan.com", [misdetection of things like ""22, no." and mistaking ENG sentences with MRI words]
1993
1994+? "http://whatonga.school.nz", [school title]
1995+? "https://player.vimeo.com", [Video titles contain MRI sentences and MRI titles interspersed in ENG sentence. Video's audio content may be in MRI]
1996+ "http://www.writersfestival.co.nz", (e.g. http://www.writersfestival.co.nz/news/Page1/urgently-relevant-novel-wins-countrys-richest-literary-award/)
1997+? "http://southerntribes.co.nz", [brazilian jiu jitsu site with a greeting in MRI on main page]
1998+ "http://www.kmk.maori.nz", (e.g. http://www.kmk.maori.nz/kmk-events)
1999+ "https://www.stats.govt.nz", (e.g. http://archive.stats.govt.nz/Census/2001-census-data/2001-census-pacific-profiles/cook-island-maori-people-in-new-zealand.aspx)
2000X "http://www.eventcinemas.co.nz", [placenames and misdetection of ENG "Atura Hotels"]
2001X "https://www.zenbu.co.nz" [misdetection and NZ school addresses]
2002 ],
2003 "numPagesInMRICount" : 0,
2004 "numPagesContainingMRICount" : 1673
2005}
2006
200780 sites detected as having 0 pages inMRI but >0 pages that containMRI.
2008
2009[Of these 9 are part of the same site/subdomain => 71 unique sites.
2010Of the remaining ones, only 35 have at least one sentence in Maori and are marked with +. (Those marked with +? just have Maori titles or greetings or nothing more than a sentence.)
2011So in this set, there's a further 35 sites that contain MRI out of 71 unique sites detected as having pages containingMRI but not pages inMRI.
2012Total sites: 35/71
2013Total for NZ: (84+35)/(89+71) = 119/160 unique NZ sites have at least one webpage containing at least one sentence inMRI.
2014]
2015
2016TOTAL:
2017If counting subdomains and duplicated sites distinctly, then 35 + an additional 3 sites, making it 38/80 sites in this set.
2018
2019This makes (88+38)/(96+80) = 126/176 NZ sites (counting distinct subdomains and duplicated sites) that contain at least one web page with at least 1 sentence in MRI.
2020
2021
2022
2023
20243. GRAND TOTALS
2025
2026Count per country of web SITES that contain at least 1 web page containing at least 1 genuine MRI sentence. (Number in brackets for overseas is number of sites of that geolocation if nz TLDs were NOT grouped with NZ geolocation under "NZ". Number in brackets for NZ indicates the number of sites that are only of NZ geolocation ignoring nz TLDs hosted overseas. Numbers only present where different from counts of site by geolocation, which is the number indicated out of brackets.)
2027
2028OLD
2029countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI
2030NZ: 126 actual sites out of 176 (89) detected sites
2031US: 29 actual out of 422 (486) detected sites
2032AU: 2 actual out of 5 (21) detected sites
2033DE, Germany: 2 actual out of 27 detected sites
2034DK, Denmark: 2 out of 8
2035BG, Bulgaria: 1 out of 1
2036CZ, Czech Republic: 1 out of 4
2037ES, Spain: 1 out of 5 (7)
2038FR, France: 1 out of 35 (36)
2039IE, Ireland: 1 out of 2
2040
2041NEW - Adjusted grand totals above with changes to values after reingesting into mongodb (the adjusted values are from section C below). The number in brackets here are the UNIQUE domain names/sites that OpenNLP detected as having pages containing MRI, where different.
2042
2043countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI
2044NZ: 124 (113 + 11 non-unique) actual sites out of 176 (159) detected sites
2045US: 32 actual out of 422 (405) detected sites
2046AU: 1 actual out of 5 detected sites
2047DE, Germany: 2 actual out of 26 (24) detected sites
2048DK, Denmark: 2 out of 8
2049BG, Bulgaria: 1 out of 1
2050CZ, Czech Republic: 1 out of 5 (4)
2051ES, Spain: 1 out of 5
2052FR, France: 1 out of 35 (34)
2053IE, Ireland: 1 out of 2
2054
2055TOTAL: 167 sites of all the crawled sites where the crawled set of pages per site actually contained at least one sentence in Māori based on manual inspection.
2056Out of a total of 221+471+176 = 869 sites that were detected with numPagesContainingMRI > 0 (868 sites containing at least one page with at least one sentence detected in MRI)
2057
2058========================================
2059In the 2nd table (immediately above), I've adjusted grand totals with the following.
2060
2061----------------------------------------------------------------------
2062C GEOLOCATION CHANGES AFTER REINGESTING UPON INTRODUCING ANGLICAN.ORG:
2063----------------------------------------------------------------------
2064NZ the same as before
2065 NL, DE, FR, DK, ES, GB same
2066 IT, AT, RO, CH, RU, BG, MX, JP, CN, IE, IR, FI same
2067
2068US gained 3:
2069+ anglican.org (NEW)
2070X articles.imperialtometric.com (from CA)
2071X daandehn.com (CA)
2072
2073CA lost 2:
2074X articles.imperialtometric.com (to US)
2075X daandehn.com (to US)
2076
2077AU:
2078+ ! lost kiwiproperty.com (to US - mi in URL path version file!)
2079
2080
2081CZ:
2082X gained viveipcl.com (from UNKNOWN)
2083
2084UNKNOWN:
2085X gained hitiaotera.com from IL
2086
2087IL:
2088X lost one (hitiaotera.com to UNKNOWN)
2089
2090-----------------
2091FINAL COUNT OF unique SITES (that contain >= 1 page with >= 1 MRI sentence)
2092-----------------
2093
2094DK (2):
2095http://ngapuhiradio.com
2096http://ngapuhitelevision.com
2097 [http://akona.ngapuhitelevision.com
2098 http://waiatarangatiratanga.ngapuhitelevision.com
2099 http://jazz.ngapuhitelevision.com
2100 http://powhiri.ngapuhitelevision.com
2101 http://komisch.ngapuhitelevision.com]
2102
2103DE (2)
2104http://www.udhr.de
2105https://www.cartogiraffe.com
2106
2107AU (1)
2108https://koreromaori.com
2109
2110FR (1)
2111http://chantsdeluttes.free.fr
2112
2113ES (1)
2114https://www.uv.es
2115
2116IE (1)
2117https://coggle.it
2118
2119CZ: (1)
2120http://www.henryklahola.nazory.cz
2121
2122BG: (1)
2123http://anitra.net
2124
2125US finals 31 (33):
2126http://anglican.org
2127http://anglicanhistory.org
2128http://www.unicode.org
2129https://static-promote.weebly.com
2130http://aclhokiangarocks.blogspot.com
2131http://bahaiprayers.net
2132https://biblehub.com
2133http://www.muhammad.com
2134http://www.godrules.net
2135http://m.biblepub.com
2136http://www.krassotkin.ru
2137http://www.gotquestions.org
2138https://maorinews.com
2139http://maaori.com
2140http://kiaorahola.blogspot.com
2141https://kjohnsonnz.blogspot.com
2142http://pumanawawhangara.blogspot.com
2143http://dannykahei.tripod.com
2144http://burkekm001.tripod.com
2145http://tkkpipipaopao.blogspot.com
2146http://manateina.blogspot.com
2147http://tatai09.blogspot.com
2148http://www.twttoa.com
2149http://tuhua2010.blogspot.com
2150http://piripi.blogspot.com
2151https://drive.google.com
2152https://in.pinterest.com
2153+? https://www.breaker.audio [AUDIO]
2154+X http://ritusehji.blogspot.com
215527 (28)
2156
2157https://www.kiwiproperty.com
2158http://indigenousblogs.com
2159https://mi.m.wikipedia.org
2160https://mi.wikipedia.org **
2161http://csunplugged.org [includes https://www.csunplugged.org]
2162?~ https://policies.oclc.org
2163
2164+ 4 (5) = 31 (33) incl with MI in URL Path
2165** Listing distinctly as subdomain prefixes don't match, so querying MongoDB for matches on /mi.wikipedia.org/ won't get us results for /mi.m.wikipedia.org/ and vice-versa
2166
2167
2168NZ: 113 unique + 11 non-unique
2169http://www.teipukarea.maori.nz
2170http://ngatipahauwera.co.nz
2171http://www.oag.govt.nz
2172https://sexualviolence.victimsinfo.govt.nz
2173http://tmoa.tki.org.nz
2174http://www.tewhanake.maori.nz
2175http://www.matarikifestival.org.nz
2176http://www.otepoti.school.nz
2177https://www.maoritelevision.com
2178http://pukapuka.nz
2179http://community.nzdl.org
2180http://maori.livingheritage.org.nz [http://www.livingheritage.org.nz]
2181http://pukoro.co.nz
2182https://cdn.tehiku.nz [DOMAIN: tehiku.nz]
2183http://www.runanga.co.nz
2184http://kuraaiwi.maori.nz
2185http://kurataiao.tki.org.nz
2186http://satellites.co.nz
2187http://teaohou.natlib.govt.nz
2188http://www.tuwharetoa.iwi.nz
2189https://www.terito.school.nz
2190https://ttw1.cwp.govt.nz
2191https://www.whanau-tahi.school.nz
2192https://e-ako-pangarau.nzmaths.co.nz
2193https://teaomaori.news
2194http://tetaurawhiri.govt.nz
2195https://www.tuiatematangi.ac.nz
2196http://animations.tewhanake.maori.nz
2197https://www.dnc.org.nz
2198http://firstworldwar.tki.org.nz [http://www.firstworldwar.tki.org.nz]
2199http://www.28maoribattalion.org.nz
2200http://www.tewikiotereomaori.co.nz
2201http://www.brettgraham.co.nz
2202https://hepatakakupu.nz
2203http://anglicanprayerbook.nz
2204http://arataua.nz
2205http://maori.tki.org.nz
2206https://paekupu.co.nz
2207https://haereheikaiako.co.nz
2208https://curriculumtool.education.govt.nz
2209http://kurakokiri.maori.nz [includes: http://www.kurakokiri.maori.nz]
2210http://www.kkmmaungarongo.co.nz
2211http://www.heartland.co.nz
2212http://oilcrash.com
2213http://www.kura-porirua.school.nz
2214https://www.sporty.co.nz
2215https://www.tematawai.maori.nz
2216https://www.terakipaewhenua.school.nz
2217http://www.tetaurawhiri.govt.nz
2218http://archive.stats.govt.nz
2219http://tiritiowaitangi.govt.nz
2220http://www.waiata.maori.nz [includes: http://waiata.maori.nz]
2221http://hana.co.nz
2222http://kaupare.co.nz
2223http://www.tereowrap.nz
2224http://www.hrc.co.nz
2225http://ngatiporoukiponeke.org.nz
2226http://rurued.school.nz
2227http://www.twtop.school.nz
2228http://www.huri-translations.pf
2229https://teara.govt.nz [https://admin.teara.govt.nz, http://blog.teara.govt.nz]
2230https://tiritiowaitangi.govt.nz
2231http://www.tmoa.tki.org.nz
2232https://www.komako.org.nz
2233http://www.wcl.govt.nz [included:http://kete.wcl.govt.nz]
2234http://punareo.co.nz
2235https://rapuatearatika.education.govt.nz
2236http://tmmkkm.school.nz
2237http://www.cs.waikato.ac.nz
2238http://www.kupengahao.co.nz
2239https://www.hapuhauora.health.nz
2240http://cms.sunsmartschools.co.nz [http://sunsmartschools.co.nz/]
2241http://kuraproductions.co.nz
2242https://keepourmoneyclean.govt.nz
2243http://www.tekura.school.nz
2244http://www.tkkmmokopuna.school.nz
2245http://hangaraumatihiko.tki.org.nz
2246http://www.pakanae.maori.nz
2247--- 78+9
2248http://holyspirit.nz
2249https://www.ngamanawainc.co.nz [includes http://www.ngamanawainc.co.nz]
2250http://www.finlaysonpark.school.nz
2251http://www.w3vietnam.org.nz [includes http://w3vietnam.org.nz]
2252https://www.takitimu.ac.nz
2253https://kotahimiriona.co.nz
2254https://rehuamarae.co.nz
2255http://reoora.co.nz
2256https://manawatuheritage.pncc.govt.nz
2257http://rsnz.natlib.govt.nz
2258https://www.taitokerautrust.org.nz
2259http://tewikiotereomaori.nz
2260https://www.korokikahukura.co.nz
2261https://www.pinterest.nz
2262https://www.rereahu.maori.nz
2263http://givealittle.co.nz
2264https://kaiiwicamp.nz [includes http://kaiiwicamp.nz]
2265http://ngarauhuia.ngatiapakiterato.iwi.nz
2266https://m.wairarapatv.co.nz
2267http://avonside.net
2268http://www.maoriinvestments.co.nz
2269http://conference.tpwt.maori.nz
2270https://www.puau.school.nz
2271http://tehauora.org.nz
2272http://temahurehure.maori.nz
2273http://www.temarareo.org
2274http://www.tetaumuturunanga.iwi.nz
2275http://www.writersfestival.co.nz
2276http://www.kmk.maori.nz
2277https://www.stats.govt.nz [includes http://archive.stats.govt.nz]
2278---30+4
2279+? http://ngatiwhakaue.iwi.nz
2280+? https://interactives.stuff.co.nz
2281+? http://whatonga.school.nz
2282+? https://player.vimeo.com
2283+? http://southerntribes.co.nz
2284---78+30+(5)=113 unique + 11 non-unique
2285?X https://www.e-agent.nz [includes: https://office.e-agent.nz,http://videos.e-agent.nz]
Note: See TracBrowser for help on using the repository browser.