source: other-projects/maori-lang-detection/MoreReading/mongodb.txt@ 33914

Last change on this file since 33914 was 33914, checked in by ak19, 4 years ago

Shortlisted just the domain sites by country into ManualShortlist2.txt after taking the reingest into MongoDB into account. And then put all these shortlisted domains for which containsMRI=true as per manual inspection into a separate new file.

File size: 71.8 KB
Line 
1MongoDB
2Installation:
3 https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
4 https://docs.mongodb.com/manual/administration/install-on-linux/
5 https://hevodata.com/blog/install-mongodb-on-ubuntu/
6 https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04
7 CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/
8 FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source
9GUI:
10 https://robomongo.org/
11 Robomongo is Robo 3T now
12
13https://www.tutorialspoint.com/mongodb/mongodb_java.htm
14JAR FILE:
15 http://central.maven.org/maven2/org/mongodb/mongo-java-driver/
16 https://mongodb.github.io/mongo-java-driver/
17
18
19
20https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
21http://www.programmersought.com/article/6500308940/
22
23 52 sudo apt-get install mongodb-clients
24 53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
25
26Failed with
27 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
28 exception: connect failed
29
30This is due to a version incompatibility between Client and mongodb Server.
31The solution is to follow instructions at http://www.programmersought.com/article/6500308940/
32and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
33as below:
34
35 54 sudo apt-get purge mongodb-clients
36 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
37 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
38 57 sudo apt-get update
39 58 sudo apt-get install mongodb-clients
40 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
41(still doesn't work)
42 60 sudo apt-get install -y mongodb-org
43The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server?
44 72 sudo service mongod status
45
46 103 sudo service mongod start
47"mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections
48 104 sudo service mongod status
49 88 sudo service mongod stop
50
51
52DETAILS:
53
54wharariki:[879]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
55
56didn't work with the pwd. Failed with:
57
58 MongoDB shell version: 2.6.10
59 Enter password:
60 connecting to: mongodb://mongodb.cms.waikato.ac.nz:27017
61 2019-11-04T20:02:47.970+1300 Assertion: 13110:HostAndPort: host is empty
62 2019-11-04T20:02:47.970+1300 0x6b75c9 0x659e9f 0x636f69 0x4fa55c 0x501249 0x4fa7f1 0x6006fd 0x5eb869 0x7f7bfbd47d76 0x1f3c10d06362
63 mongo(_ZN5mongo15printStackTraceERSo+0x39) [0x6b75c9]
64 mongo(_ZN5mongo10logContextEPKc+0x21f) [0x659e9f]
65 mongo(_ZN5mongo11msgassertedEiPKc+0xd9) [0x636f69]
66 mongo(_ZN5mongo16ConnectionString12_fillServersENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50c) [0x4fa55c]
67 mongo(_ZN5mongo16ConnectionStringC1ENS0_14ConnectionTypeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_+0x99) [0x501249]
68 mongo(_ZN5mongo16ConnectionString5parseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERS6_+0x201) [0x4fa7f1]
69 mongo(_ZN5mongo17mongoConsExternalEPNS_7V8ScopeERKN2v89ArgumentsE+0x11d) [0x6006fd]
70 mongo(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0xa9) [0x5eb869]
71 /usr/lib/libv8.so.3.14.5(+0x99d76) [0x7f7bfbd47d76]
72 [0x1f3c10d06362]
73 2019-11-04T20:02:47.971+1300 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
74 exception: connect failed
75
76
77This is due to a version incompatibility between Client and mongodb Server.
78Can find client version above. (2.6.10)
79Server version can be found by running the mongo client shell. Doing so without loading a db:
80
81
82 wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb
83 MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION
84 type "help" for help
85 > help
86 db.help() help on db methods
87 db.mycoll.help() help on collection methods
88 sh.help() sharding helpers
89 rs.help() replica set helpers
90 help admin administrative help
91 help connect connecting to a db help
92 help keys key shortcuts
93 help misc misc things to know
94 help mr mapreduce
95
96 show dbs show database names
97 show collections show collections in current database
98 show users show users in current database
99 show profile show most recent system.profile entries with time >= 1ms
100 show logs show the accessible logger names
101 show log [name] prints out the last segment of log in memory, 'global' is default
102 use <db_name> set current database
103 db.foo.find() list objects in collection foo
104 db.foo.find( { a : 1 } ) list objects in foo where a == 1
105 it result of the last line evaluated; use to further iterate
106 DBQuery.shellBatchSize = x set default number of items to display on shell
107 exit quit the mongo shell
108
109 > help connect
110
111 Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options.
112 Additional connections may be opened:
113
114 var x = new Mongo('host[:port]');
115 var mydb = x.getDB('mydb');
116 or
117 var mydb = connect('host[:port]/mydb');
118
119 Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
120
121 Getting help on connect options:
122
123 > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017');
124 > var mydb = x.getDB('anupama');
125
126 > mydb.connect.help()
127 DBCollection help
128 db.connect.find().help() - show DBCursor help
129 db.connect.count()
130 db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
131 db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command
132 db.connect.dataSize()
133 db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' )
134 db.connect.drop() drop the collection
135 db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } )
136 db.connect.dropIndexes()
137 db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
138 db.connect.reIndex()
139 db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
140 e.g. db.connect.find( {x:77} , {name:1, x:1} )
141 db.connect.find(...).count()
142 db.connect.find(...).limit(n)
143 db.connect.find(...).skip(n)
144 db.connect.find(...).sort(...)
145 db.connect.findOne([query])
146 db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
147 db.connect.getDB() get DB object associated with collection
148 db.connect.getPlanCache() get query plan cache associated with collection
149 db.connect.getIndexes()
150 db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
151 db.connect.insert(obj)
152 db.connect.mapReduce( mapFunction , reduceFunction , <optional params> )
153 db.connect.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
154 db.connect.remove(query)
155 db.connect.renameCollection( newName , <dropTarget> ) renames the collection.
156 db.connect.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
157 db.connect.save(obj)
158 db.connect.stats()
159 db.connect.storageSize() - includes free space allocated to this collection
160 db.connect.totalIndexSize() - size in bytes of all the indexes
161 db.connect.totalSize() - storage allocated for all data and indexes
162 db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi
163 db.connect.validate( <full> ) - SLOW
164 db.connect.getShardVersion() - only for use with sharding
165 db.connect.getShardDistribution() - prints statistics about data distribution in the cluster
166 db.connect.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
167 db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
168 db.connect.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
169 db.connect.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
170 > mydb.version()
171 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION
172
173(Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb)
174
175Finally we now know the mongodb server version: 4.0.13
176This version doesn't work with our mongo client (shell) version of 2.6.10.
177
178
179DETAILS OF INSTALLING MONGO-CLIENT AND UPDATING IT, AND INSTALLING MONGODB SERVER:
180
181
182 54 sudo apt-get purge mongodb-clients
183 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
184 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
185 57 sudo apt-get update
186 58 sudo apt-get install mongodb-clients
187 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
188 60 sudo apt-get install -y mongodb-org
189 61 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
190 62 sudo service apache2 status
191 63 sudo service sshd status
192 64 sudo service mongodb status
193 65 sudo service mongo status
194 66 mongod
195 67 mongod --help
196 68 mongod --help | less
197 69 mongod -f /etc/mongod.conf
198 70 sudo mongod -f /etc/mongod.conf
199 71 less /etc/mongod.conf
200 72 sudo service mongod status
201 73 sudo service mongod start
202 74 sudo service mongod status
203 75 ls -l /var/log/mongodb/mongod.log
204 76 sudo rm /var/log/mongodb/mongod.log
205 77 sudo service mongod status
206 78 sudo service mongod start
207 79 sudo service mongod status
208 80 sudo service mongod stop
209 81 ps auxww | grep mongo
210 82 sudo service mongod start
211 83 sudo service mongod status
212 84 ps auxww | grep mongo
213 85 sudo dmsg
214 86 sudo dmesg
215 87 sudo service mongod status
216 88 sudo service mongod stop
217 89 sudo service mongod start
218 90 sudo dmesg
219 91 sudo less /var/log/mongodb/mongod.log
220 92 ls /var/lib/
221 93 ls -ld /var/lib/
222 94 ls -l /var/log/mongodb/mongod.log
223 95 ls -ld /var/lib/
224 96 groups mongodb
225 97 less /etc/mongod.conf
226 98 sudo less /var/log/mongodb/mongod.log
227 99 less /etc/mongod.conf
228 100 ls -l /var/lib/mongodb/
229 101 sudo chown -R mongodb /var/lib/mongodb/
230 102 sudo chgrp -R mongodb /var/lib/mongodb/
231 103 sudo service mongod start
232 104 sudo service mongod status
233 105 history
234
235
236
237MONGO DB ROBO 3T
2381. Download "Double Pack" from https://robomongo.org/
2392. Untar its contents. Then untar the tarball in that.
2403. Run:
241 wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t
242
243===================
244On analytics, vagrant node1, we've installed the mongodb server and client.
245We're able to successfully create collections on here.
246
247
248vagrant@node1:~$ mongo
249MongoDB shell version v4.2.1
250connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
251Implicit session: session { "id" : UUID("87bb585c-4685-47f6-bf89-a93801daeb2d") }
252MongoDB server version: 4.2.1
253Server has startup warnings:
2542019-11-04T07:48:14.197+0000 I STORAGE [initandlisten]
2552019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2562019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** See http://dochub.mongodb.org/core/prodnotes-filesystem
2572019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
2582019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database.
2592019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted.
2602019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
261---
262Enable MongoDB's free cloud-based monitoring service, which will then receive and display
263metrics about your deployment (disk utilization, CPU, operation statistics, etc).
264
265The monitoring data will be available on a MongoDB website with a unique URL accessible to you
266and anyone you share the URL with. MongoDB may use this information to make product
267improvements and to suggest MongoDB products and deployment options to you.
268
269To enable free monitoring, run the following command: db.enableFreeMonitoring()
270To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
271---
272
273> show dbs
274admin 0.000GB
275config 0.000GB
276local 0.000GB
277> use db ateacrawldata
2782019-11-05T05:24:20.155+0000 E QUERY [js] Error: [db ateacrawldata] is not a valid database name :
279Mongo.prototype.getDB@src/mongo/shell/mongo.js:51:12
280getDatabase@src/mongo/shell/session.js:913:28
281DB.prototype.getSiblingDB@src/mongo/shell/db.js:22:12
282shellHelper.use@src/mongo/shell/utils.js:803:10
283shellHelper@src/mongo/shell/utils.js:790:15
284@(shellhelp2):1:1
285> db.createCollection('webpages');
286{ "ok" : 1 }
287> db.webpages.drop();
288... ^C
289
290> db.webpages.drop();
291true
292> use ateacrawldata
293switched to db ateacrawldata
294> db.createCollection('webpages');
295{ "ok" : 1 }
296> show collections
297webpages
298> db.createCollection('websites');
299{ "ok" : 1 }
300>
301
302------------------------
303
304Ask Clint to rename "anupama" database to "ateacrawldata" database following the instructions at:
305 https://stackoverflow.com/questions/9201832/how-do-you-rename-a-mongodb-database
306I don't have permissions to do this.
307Nor do I have permissions to create Mongo collections within a new database that I create, like ateacrawldata.
308I only seem to have rights to the "anupama" database.
309
310
311
312-----------------------
313Vagrant virtual machine Node1 has the mongodb installed.
314
315After doing "vagrant up" on node1 to start node1:
316
317 [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh
318 vagrant@node1:~$ mongo
319 MongoDB shell version v4.2.1
320 connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
321 2019-11-13T09:22:46.996+0000 E QUERY [js] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused :
322 connect@src/mongo/shell/mongo.js:341:17
323 @(connect):2:6
324 2019-11-13T09:22:46.999+0000 F - [main] exception: connect failed
325 2019-11-13T09:22:46.999+0000 E - [main] exiting with code 1
326 vagrant@node1:~$ sudo service mongod status
327 ● mongod.service - MongoDB Database Server
328 Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
329 Active: inactive (dead)
330 Docs: https://docs.mongodb.org/manual
331 vagrant@node1:~$ sudo service mongod start
332 vagrant@node1:~$ sudo service mongod status
333 ● mongod.service - MongoDB Database Server
334 Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
335 Active: active (running) since Wed 2019-11-13 09:24:07 UTC; 2s ago
336 Docs: https://docs.mongodb.org/manual
337 Main PID: 4383 (mongod)
338 Tasks: 32
339 Memory: 199.3M
340 CPU: 754ms
341 CGroup: /system.slice/mongod.service
342 └─4383 /usr/bin/mongod --config /etc/mongod.conf
343
344 Nov 13 09:24:07 node1 systemd[1]: Started MongoDB Database Server.
345 vagrant@node1:~$
346
347
348So now mongodb is running on node1 on localhost:27017.
349
350Next, in another x-term on analytics connected to the node1 Vagrant VM while port forwarding node1's localhost:27017 to analytics' localhost:27017:
351 vagrant ssh -- -L 27017:localhost:27017
352
353
354
355Finally, in another x-term (on wharariki), port-forward from analytics:27017 to current machine's 27017:
356 ssh -L 27017:localhost:27017 analytics
357
358
359Run Robo-3T: go to /home/anupama/Downloads/robo3t-1.3.1-linux-x86_64-7419c406/bin
360and double click robo3t
361
362In the connection screen, choose localhost:27017.
363Now can connect Robo-3T running on current machine to localhost:27017.
364
365Then in a new x-term, can use the client mongo shell to connect (by default to localhost:27017):
366
367 wharariki:[122]/Scratch/ak19/GS309>mongo --shell
368 MongoDB shell version v4.0.13
369 connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
370 ...
371 > show dbs
372 admin 0.000GB
373 ateacrawldata 1.532GB
374 config 0.000GB
375 local 0.000GB
376 > use ateacrawldata
377
378 > show collections
379 Webpages
380 Websites
381 oldwebpages
382 oldwebsites
383-------------------
384
385Country code to geolocation CSV file found by Dr Bainbridge:
386https://developers.google.com/public-data/docs/canonical/countries_csv
387
388Import into mongodb with:
389https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv
390
391
392
393NOTE: mongoimport is a commandline utility and not a command to be run from the mongo shell. See https://jira.mongodb.org/browse/DOCS-11072
394This means, in an x-term, DON'T RUN MONGO SHELL/client first. Instead, directly from x-term, run the following to import the countrycodes.csv file:
395
396
397 mongoimport -d ateacrawldata -c countrylocations --type csv --file /Scratch/ak19/maori-lang-detection/MoreReading/countrycodes.csv --headerline
398
399
400-------------------------
401
402MONGODB QUERIES:
403
404db.getCollection('webpages').find({"isMRI": true, "singleSentences.langCode": "mri"})
405db.getCollection('webpages').find({"singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"})
406db.getCollection('Webpages').find({"isMRI": true, "singleSentences": { $elemMatch: {"langCode":"eng"} } }, {"singleSentences.$": "eng"}) [single English lang sentence]
407db.getCollection('Webpages').find({"containsMRI": true, "singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) [gets 1st sentence of docs which have sentences containing MRI]
408
409
410READING
411
412mongodb java convert class
413https://www.quora.com/What-are-the-ways-of-converting-a-Java-object-to-a-MongoDB-document-and-vice-versa
414https://stackoverflow.com/questions/39320825/pojo-to-org-bson-document-and-vice-versa
415X https://mongodb.github.io/morphia/
416https://stackoverflow.com/questions/10170506/inserting-java-object-to-mongodb-collection-using-java
417X https://www.google.com/search?q=morphia+example&oq=morphia+example&aqs=chrome.0.0l6.4223j0j9&sourceid=chrome&ie=UTF-8
418https://www.baeldung.com/mongodb-morphia
419X https://web.archive.org/web/20171117121335/http://mongodb.github.io/morphia/1.3/getting-started/
420=> https://morphia.dev/1.4/getting-started/quick-tour/
421https://github.com/MorphiaOrg/morphia/tree/master/docs/reference
422
423
424mongodb querying
425https://docs.mongodb.com/manual/tutorial/query-embedded-documents/
426https://docs.mongodb.com/manual/tutorial/query-arrays/
427https://www.google.com/search?q=mongodb+find+subdocument&oq=mongodb+find+&aqs=chrome.0.69i59j69i57j0l4.7607j1j8&sourceid=chrome&ie=UTF-8
428https://stackoverflow.com/questions/25586901/how-to-find-document-and-single-subdocument-matching-given-criterias-in-mongodb
429https://stackoverflow.com/questions/21113543/mongodb-get-subdocument
430https://stackoverflow.com/questions/36948856/find-subdocuments-in-mongo
431https://docs.mongodb.com/v3.0/reference/operator/projection/positional/#proj._S_
432https://www.google.com/search?q=mongodb+query+tutorial&oq=mongodb+query+tutorial&aqs=chrome..69i57j0l2j69i60l3.4719j0j7&sourceid=chrome&ie=UTF-8
433https://blog.exploratory.io/an-introduction-to-mongodb-query-for-beginners-bd463319aa4c
434https://docs.mongodb.com/manual/reference/method/db.collection.find/
435https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection
436https://stackoverflow.com/questions/39641925/mongodb-aggregation-framework-to-get-frequencies-of-fields-values
437
438https://exploratory.io/note/kanaugust/0961813761939766
439https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/
440https://docs.mongodb.com/manual/aggregation/
441
442
443Mongo Studio 3T documentation:
444https://studio3t.com/download/ (also has uninstall information)
445https://studio3t.com/download-thank-you/?OS=x64
446
447Google: MongoDB visualization
448MongoDB visualization map
449MongoDB Charts
450 (Open source visualisation tools)
451
452json map visualizer
453 geojson.tools
454-------------------
455
456Some queries with results:
457
458# Num websites
459db.getCollection('Websites').find({}).count()
4601445
461
462# Num webpages
463db.getCollection('Webpages').find({}).count()
464X75139
465117496
466
467# Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI)
468db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
469361
470
471# Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
472db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
473868
474
475# Obviously, the union of the above two will be identical to numPagesContainingMRI:
476db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
477868
478
479# Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
480db.getCollection('Webpages').find({isMRI:true}).count()
481X5224
482X5215
483db.getCollection('Webpages').find({isMRI:true}).count()
4847818
485
486# Number of pages that contain any number of MRI sentences
487db.getCollection('Webpages').find({containsMRI: true}).count()
488X12858
48920371
490
491
492# Number of sites with URLs containing /mi(/)
493db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
494X 153
495# Number of sites with URLs containing /mi(/) OR http(s)://mi.*
496db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
497670
498
499# Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls
500db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
501X 147
502# Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* in any of its sub-urls
503db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
504656
505
506# 6 sites with URLs containing /mi(/) that are in NZ
507db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count()
508X 6
509# 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ
51014
511
512
513# sort websites that contain /mi(/) in path by geoLocationCountryCode
514# https://www.quackit.com/mongodb/tutorial/mongodb_sort_query_results.cfm
515db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).sort({geoLocationCountryCode: 1})
516
517Actually, I want to sort by count. See https://docs.mongodb.com/manual/reference/operator/aggregation/sortByCount/
518
519
520# PROJECTION:
521db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInPath: 1})
522
523https://docs.mongodb.com/manual/aggregation/
524EXAMPLE:
525db.orders.aggregate([
526 { $match: { status: "A" } },
527 { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
528])
529
530X db.Websites.aggregate([{ $match:{urlContainsLangCodeInPath:true}}, $group: {geoLocationCountryCode:1, total: $count}])
531
532
533X db.Websites.aggregate([
534 { $match:{urlContainsLangCodeInPath:true}},
535 {$group: {geoLocationCountryCode:1}}
536])
537
538WORKS (but an "unwind" will get rid of "null"):
539db.Websites.aggregate([
540 { $match:{urlContainsLangCodeInPath:true}},
541 {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}}},
542 { $sort : { count : -1} }
543])
544
545
546# COUNT OF ALL GEOLOCATION COUNTRIES
547#https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
548 # LIST
549 db.Websites.distinct('geoLocationCountryCode');
550
551 # COUNT
552 db.Websites.distinct('geoLocationCountryCode').length;
553
554 # A COUNT WITH QUERY - https://docs.mongodb.com/manual/reference/command/distinct/#dbcmd.distinct
555
556 db.runCommand ( { distinct: "Websites", key: "geoLocationCountryCode", query: { "urlContainsLangCodeInPath": true} } );
557
558 # DISTINCT WITH QUERY WITHOUT COUNT - https://docs.mongodb.com/manual/reference/method/db.collection.distinct/
559 db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true});
560
561 #SORTED - https://stackoverflow.com/questions/4759437/get-distinct-values-with-sorted-data
562 db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}).sort();
563
564
565 # count of all sites for which the geolocation is UNKNOWN
566 db.getCollection('Websites').find({geoLocationCountryCode: {$eq:"UNKNOWN"}}).count()
567
568
569# AGGREGATION QUERIES THAT WORK:
570#https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
571
572WORKS:
573// count of country codes for all sites
574db.Websites.aggregate([
575
576 { $unwind: "$geoLocationCountryCode" },
577 {
578 $group: {
579 _id: "$geoLocationCountryCode",
580 count: { $sum: 1 }
581 }
582 },
583 { $sort : { count : -1} }
584]);
585
586// count of country codes for sites that have at least one page detected as MRI
587
588db.Websites.aggregate([
589 {
590 $match: {
591 numPagesInMRI: {$gt: 0}
592 }
593 },
594 { $unwind: "$geoLocationCountryCode" },
595 {
596 $group: {
597 _id: {$toLower: '$geoLocationCountryCode'},
598 count: { $sum: 1 }
599 }
600 },
601 { $sort : { count : -1} }
602]);
603
604// count of country codes for sites that have at least one page containing at least one sentence detected as MRI
605db.Websites.aggregate([
606 {
607 $match: {
608 numPagesContainingMRI: {$gt: 0}
609 }
610 },
611 { $unwind: "$geoLocationCountryCode" },
612 {
613 $group: {
614 _id: {$toLower: '$geoLocationCountryCode'},
615 count: { $sum: 1 }
616 }
617 },
618 { $sort : { count : -1} }
619]);
620
621
622WORKS:
623// count of country codes for sites that have /mi(/) or http(s)://mi.* in URL path
624
625db.Websites.aggregate([
626 {
627 $match: {
628 urlContainsLangCodeInPath: true
629 }
630 },
631 { $unwind: "$geoLocationCountryCode" },
632 {
633 $group: {
634 _id: {$toLower: '$geoLocationCountryCode'},
635 count: { $sum: 1 }
636 }
637 },
638 { $sort : { count : -1} }
639]);
640
641
642WORKS:
643db.Websites.aggregate([
644 {
645 $match: {
646 geoLocationCountryCode: {$ne : "UNKNOWN"}
647 }
648 },
649 { $unwind: "$geoLocationCountryCode" },
650 {
651 $group: {
652 _id: "$geoLocationCountryCode",
653 count: { $sum: 1 }
654 }
655 },
656 { $sort : { count : -1} }
657]);
658
659WORKS:
660db.Websites.aggregate([
661 {
662 $match: {
663 "urlContainsLangCodeInPath": true
664 }
665 },
666 { $unwind: "$geoLocationCountryCode" },
667 {
668 $group: {
669 _id: "$geoLocationCountryCode",
670 count: { $sum: 1 }
671 }
672 },
673 { $sort : { count : -1} }
674]);
675
676
677KEEP ADDITIONAL FIELDS - https://stackoverflow.com/questions/16662405/mongo-group-query-how-to-keep-fields:
678 a. KEEPS ONLY FIRST DOMAIN URL FOR EACH COUNTED COUNTRY CODE:
679
680 db.Websites.aggregate([
681 {
682 $match: {
683 "urlContainsLangCodeInPath": true
684 }
685 },
686 { $unwind: "$geoLocationCountryCode" },
687 {
688 $group: {
689 _id: "$geoLocationCountryCode", count: { $sum: 1 },
690 domain: {$first: '$domain'}
691 }
692 },
693 { $sort : { count : -1} }
694 ]);
695
696 b. KEEP ALL DOMAIN URLS:
697 db.Websites.aggregate([
698 {
699 $match: {
700 "urlContainsLangCodeInPath": true
701 }
702 },
703 { $unwind: "$geoLocationCountryCode" },
704 {
705 $group: {
706 _id: "$geoLocationCountryCode",
707 count: { $sum: 1 },
708 domain: { $addToSet: '$domain' }
709 }
710 },
711 { $sort : { count : -1} }
712 ]);
713
714
715# WANT TO GET THE ABOVE INTO WORLD MAP, use geojson.tools found by Dr Bainbridge
716geojson.tools
717USAGE: https://www.here.xyz/viewer-tool/
718
719
720AIMS:
721* Identify where Maori language is online.
722* How can we identify high quality sites that would be good for a corpus.
723(Related work for other languages to quantifiably answer that)
724
725data-preparation
726docs
727
728
729------------------------------------------
730
731BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
732---
733
734# https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or
735# https://docs.mongodb.com/manual/reference/operator/query/and/
736
737# 1. all the websites which are from NZ:
738db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count()
739128
740
741# 2. all the websites that have /mi in URL path which are from NZ:
742db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]})
7436
744
745# 3. all the websites that don't have /mi in URLpath
746db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count()
7471292
748
749# 4. all the websites that don't have /mi, or if they do are from NZ
750# (should be the sum of the above points 2 and 3 above)
751db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count()
7521298
753
754# 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ
755# These are the TENTATIVE NON-PRODUCT SITES
756# Should be less than the point 4, but more than 1 to 3
757
758db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count()
759X 859
760
761Now with http(s)://mi.* also excluded, the above query returns a count of:
762389
763
764
765BUT THIS IS THE CORRECT VERSION OF THE QUERY:
766db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {urlContainsLangCodeInPath: false}]}]}).count()
767389
768
769
770# 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix)
771# Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
772db.Websites.aggregate([
773 {
774 $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}
775 },
776 { $unwind: "$geoLocationCountryCode" },
777 {
778 $group: {
779 _id: {$toLower: '$geoLocationCountryCode'},
780 count: { $sum: 1 }
781 }
782 },
783 { $sort : { count : -1} }
784]);
785
786The result is very close to the same aggregate on just numPagesContainingMRI.
787
788That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few:
789
790db.Websites.aggregate([
791 {
792 $match: {
793 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]
794 }
795 },
796 { $unwind: "$geoLocationCountryCode" },
797 {
798 $group: {
799 _id: {$toLower: '$geoLocationCountryCode'},
800 count: { $sum: 1 }
801 }
802 },
803 { $sort : { count : -1} }
804]);
805
806
807_id count
808us 4.0
809nz 4.0
810au 3.0
811ru 1.0
812de 1.0
813
814Total: 13 sites that have /mi/ and are detected as having MRI content,
815db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
81613
817
818Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI.
819
820
821Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD!
822/* 1 */
823{
824 "_id" : "nz",
825 "count" : 4.0,
826 "domain" : [
827 "http://firstworldwar.tki.org.nz",
828 "http://www.firstworldwar.tki.org.nz",
829 "https://admin.teara.govt.nz",
830 "http://community.nzdl.org"
831 ]
832}
833
834/* 2 */
835{
836 "_id" : "us",
837 "count" : 4.0,
838 "domain" : [
839 "https://sexualviolence.victimsinfo.govt.nz",
840 "https://follow3rs.com",
841 "http://www.church-of-christ.org",
842 "http://www.mytrickstips.com"
843 ]
844}
845
846/* 3 */
847{
848 "_id" : "au",
849 "count" : 3.0,
850 "domain" : [
851 "https://rapuatearatika.education.govt.nz",
852 "https://www.kiwiproperty.com",
853 "https://curriculumtool.education.govt.nz"
854 ]
855}
856
857/* 4 */
858{
859 "_id" : "ru",
860 "count" : 1.0,
861 "domain" : [
862 "http://www.treningmozga.com"
863 ]
864}
865
866/* 5 */
867{
868 "_id" : "de",
869 "count" : 1.0,
870 "domain" : [
871 "http://www.almancax.com" # Website to learn German, autotranslated
872 ]
873}
874
875
876But we're not catching a potentially large number of auto-translated sites, like
877- https://www.gigalight.com/all-languages.html
878- http://www.hzhinew.com/
879
880https://culturesconnection.com/manual-or-automatic-translation/
881Manual Or Automatic Translation?
882
883Automatic translation continues to improve day by day. However, it is still unable to reach perfect levels of accuracy and lacks a natural feel. Will it ever replace human translation?
884
885--------------
886Mr Bill Rogers' suggestions for beginnings of trying to sieve out the auto-translated sites:
887- skip .com. .co.<tld>. But .co.nz is also used for non-commercial sites or sites that nevertheless have high quality Maori language content.
888- change cut-off value of OpenNLP language prediction? But for sentences and overlapping sentences, we're not using the cut-off value, we're just checking the best predicted language regardless of confidence level for this.
889
890- Code for (a range of) loading of language options in auto-translated sites?
891
892====================
893
894# https://stackoverflow.com/questions/20175122/how-can-i-use-not-like-operator-in-mongodb
895
896Info on the sites with Maori language content that are either from NZ or have .nz domain (TLD):
897
898 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]})
899
900 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}).count()
901 183
902
903Inverse: the sites detected as containing at least 1 Maori language sentence that are NOT from NZ NOR have .nz domain ending (TLD):
904 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}]}).count()
905 685
906
907The above two figures correctly add up to a total of 868 sites, which is the number of sites detected as containing at least 1 sentence in MRI:
908 db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
909 868
910
911Without those with /mi in path:
912 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]}).count()
913
914Now let's get a listing of all 685 sites to be manually inspected to determine whether they're auto-translated:
915
916/*
917db.Websites.aggregate([
918 {
919 $match: {
920 $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]
921 }
922 },
923 { $unwind: "$geoLocationCountryCode" },
924 {
925 $group: {
926 _id: {$toLower: '$geoLocationCountryCode'},
927 count: { $sum: 1 },
928 domain: { $addToSet: '$domain' }
929 }
930 },
931 { $sort : { count : -1} }
932]);
933*/
934db.Websites.aggregate([
935 {
936 $match: {
937 $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}}]
938 }
939 },
940 { $unwind: "$geoLocationCountryCode" },
941 {
942 $group: {
943 _id: {$toLower: '$geoLocationCountryCode'},
944 count: { $sum: 1 },
945 domain: { $addToSet: '$domain' }
946 }
947 },
948 { $sort : { count : -1} }
949]);
950
951
952We can knock of another 54 non-NZ sites with our new urlContainsLangCodeInPathPrefix field:
953
954 db.getCollection('Websites').find({urlContainsLangCodeInPathPrefix: true, geoLocationCountryCode: {$ne: "NZ"}, domain: {$not: /.nz$/}}).count()
955 54
956
957
958SO, can repeat query with new field "urlContainsLangCodeInPathPrefix":
959Number of sites containing >= 1 MRI sentences that are not from NZ or of .nz TLD and which don't contain "/mi(/)" or "http(s)://mi." in URL path:
960 db.getCollection('Websites').find({$and: [
961 {numPagesContainingMRI: {$gt: 0}},
962 {geoLocationCountryCode: {$ne: "NZ"}},
963 {domain: {$not: /.nz$/}},
964 {urlContainsLangCodeInPathSuffix: {$ne: true}},
965 {urlContainsLangCodeInPathPrefix: {$ne: true}}
966 ]}).count()
967
968 651
969
970
971REDO THE COUNT BY COUNTRY QUERY FOR THIS:
972
973db.Websites.aggregate([
974 {
975 $match: {
976 $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPathSuffix: {$ne: true}}, {urlContainsLangCodeInPathPrefix: {$ne: true}}]
977 }
978 },
979 { $unwind: "$geoLocationCountryCode" },
980 {
981 $group: {
982 _id: {$toLower: '$geoLocationCountryCode'},
983 count: { $sum: 1 },
984 domain: { $addToSet: '$domain' }
985 }
986 },
987 { $sort : { count : -1} }
988]);
989
990
991AFTER BUGFIX FOR miInURLPath being set at the correct stage now:
992db.getCollection('Websites').find(
993{$and: [
994 {numPagesContainingMRI: {$gt: 0}},
995 {geoLocationCountryCode: {$ne: "NZ"}},
996 {domain: {$not: /.nz$/}},
997 {urlContainsLangCodeInPath: {$ne: true}}
998]}).count()
999
1000220
1001
1002db.Websites.aggregate([
1003 {
1004 $match: {
1005 $and: [
1006 {numPagesContainingMRI: {$gt: 0}},
1007 {geoLocationCountryCode: {$ne: "NZ"}},
1008 {domain: {$not: /.nz$/}},
1009 {urlContainsLangCodeInPath: {$ne: true}}
1010 ]
1011 }
1012 },
1013 { $unwind: "$geoLocationCountryCode" },
1014 {
1015 $group: {
1016 _id: {$toLower: '$geoLocationCountryCode'},
1017 count: { $sum: 1 },
1018 domain: { $addToSet: '$domain' }
1019 }
1020 },
1021 { $sort : { count : -1} }
1022]);
1023
1024Can inspect websites' pages for whether it's relevant vs auto-translated as follows:
1025 db.getCollection('Webpages').find({URL:/svenkirsten.com/, mriSentenceCount: {$gt: 0}})
1026
1027
1028* CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/
1029 BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/
1030
1031* FR: 16 sites from FR
1032 http://blueheavenisland.com, http://www.blueheavenisland.com - misdetection. French Polynesia
1033 https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway
1034 http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori.
1035!! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
1036 http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names
1037X http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
1038 http://mahajana.net - misdetected a Japanese Zen Buddhist chant as MRI
1039 http://rapanui.fr - Rapa Nui Easter Island. Misdetected.
1040 http://www.gif.ovh - autotranslated pages. Supposedly a GIF repository
1041 http://baladeornithologique.com - misdetection of the word "Retour"
1042 http://www.gaudry.be - misdetection of Japanese hiragana etc, and French "faire", as MRI
1043 http://www.gototahiti.net - probably misdetection, see title
1044 http://www.maraamusurfskirace.com - Bora Bora, French Polynesia. Misdetected.
1045 http://www.rongo-rongo.com - appears to be related to Easter Island. Just 1 sentence however.
1046 http://pt.city-usa.net - misdetection. Hawaii.
1047 https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages.
1048NL:
1049(!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm]
1050- https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL
1051- tonhut.nl - misidentication
1052? http://nielsonboutique.co.uk, http://longhornlaw.net, http://tetsubo.org, http://hidsonphoto.com, http://wearehomework.com/- Feels autotranslated, but no language options visible. All SEO related
1053- diverosa.com - Rapa Nui, Easter Island
1054- nonlinear.demon.nl - misidentified
1055- encyclo.co.uk - misidentification
1056- henrifloor.nl - misidentification
1057- http://skimap.info/ - maps, NZ placenames in PDF
1058DK:
1059!! ++ http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,
1060http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com,
1061http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com
1062- http://www.rennertweb.de - a photogallery page mentioning NZ placenames
1063CA:
1064- http://bcmarina.com AND http://bckayak.com - photos with Canadian placenames
1065- http://www.myrasplace.net - pagse of photos, captions involving NZ placenames
1066~ http://00.gs/Maniapoto;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing
1067- aguadilla.airport-authority.com - misidentification
1068- https://articles.imperialtometric.com - misidentification
1069- http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames
1070DE:
1071- http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful
1072!! https://www.cartogiraffe.com/ and https://www.cartogiraffe.com - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
1073~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden"
1074- herocity - autotranslated
1075- weltderberge.de - 3 pages mention NZ mountains by name.
1076~ (arts.mythologica.fr) https://mythologica.fr/oceanie/texte/pantheon_polynesien.pdf - mentions certain Maori Gods and other Polynesian Gods by name.
1077- https://traynews.com - nothing in MRI, misdetected
1078~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names
1079- http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI
1080X https://afrikhepri.org/mi/ - autotranslated
1081- https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word
1082- etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected
1083- https://www.you-fly.com - misdetection of German "Warum?" as MRI
1084- http://vulkane.ch - misdetected pages on Hawaiian volcanoes.
1085- http://www.stephe.de - photos from NZ captioned with NZ placenames
1086- http://insecta.pro - misdetection
1087- http://m.distanta.1km.net - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence.
1088- https://ersatzteile-fachversand.de - German misdetected as Maori.
1089- https://laskar02cinta.page.tl/Info.htm - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense.
1090- http://www.behlig.de - misdetection. Photos from Hawaii.
1091!! http://www.udhr.de - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available.
1092- ITALY:
1093 http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename
1094 http://www.marcosanti.it/Reportage/Oceania_ph/Nuova_Zelanda/ - each photo captioned by NZ placename
1095 http://www.pegasoesmicamion.com/ - REO abbreviation misidentified, also in REO%20PUBLICIDAD.htm
1096- AUSTRIA:
1097 petit-prince.at - Tahitian and Wayuu (Venezuela) translations of Le Petit Prince
1098 http://www.tmtmm.net/newzealand - photos from NZ named after places and people's names
1099- ROMANIA: parohiauceadesus.ro - Sentences of single Romanian words misidentified.
1100- ISRAEL:
1101 http://www.daat.ac.il - misidentification of "no." as MRI, and Hebrew words.
1102 https://www.hitiaotera.com/ - misidentifiation of Tahitian pages
1103- RUSSIA: https://www.gismeteo.lv - misidentification of an email address
1104- JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI
1105!! - IRELAND, IE: https://coggle.it
1106- IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website
1107- CZECH republic:
1108? https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English
1109!! http://www.henryklahola.nazory.cz/094.Maori.htm and http://henryklahola.nazory.cz variant
1110 http://about.ilikeyou.com - dating site. Misidentification.
1111- SPAIN:
1112!! https://www.uv.es/~pla/red.net/intmaori.html
1113 https://www.reclamaciondevuelos.com - 2 occurrences of the word "kiwi"
1114 http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename
1115 http://www.cruceros-princess.mx/princessMX/Oferta_Cruzeiros_Polinesia.html - Polynesian placenames
1116- SINGAPORE: https://omg-solutions.com - autotranslated
1117- TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated
1118- MEXICO: http://www.gelbukh.com - misidentification, lines of just numbers or phrases like "Area Chair" in English and Russian CVs.
1119- FINLAND: http://pertti.com - travelogue, placenames
1120- SWITZERLAND CH:
1121 nicoledidi.ch - blog, placenames
1122 https://photos.axelebert.org - Tahiti related content
1123- UNKNOWN: https://www.viveipcl.com: tours website, placenames mentioned
1124#- EU: https://www.the-good-stuff-factory.be/mi/ : Autotranslated
1125!! - BULGARIA: http://anitra.net/activism/humanrights/UDHR/rrt_print.htm (2 pages)
1126
1127
1128TREATING AUSTRALIA AND GREAT BRITAIN MORE SPECIALLY (don't ignore /mi in URL, same as with NZ, but do leave out .nz TLDs since we cover them under NZ - TODO: later find country codes of all .nz TLDs):
1129[nothing found under "UK", only under "GB"]
1130
1131db.getCollection('Websites').find({
1132 domain: {$not: /.nz$/},
1133 numPagesContainingMRI: {$gt: 0},
1134 $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}]
1135}).count()
113611
1137
1138db.Websites.aggregate([
1139 {
1140 $match: {
1141 domain: {$not: /.nz$/},
1142 numPagesContainingMRI: {$gt: 0},
1143 $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}]
1144 }
1145 },
1146 { $unwind: "$geoLocationCountryCode" },
1147 {
1148 $group: {
1149 _id: {$toLower: '$geoLocationCountryCode'},
1150 count: { $sum: 1 },
1151 domain: { $addToSet: '$domain' }
1152 }
1153 },
1154 { $sort : { count : -1} }
1155]);
1156
1157AUSTRALIA:
1158!! https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]
1159? http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!")
1160X!! https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd - site of individual pages (like docs.google.com). This one has a relevant infogram image. But it's English with MRI in the image legend and captions.
1161!! https://koreromaori.com - some actual Maori language sentences
1162 http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames
1163
1164UK:
1165 http://www.wordsearchfun.com/200628_Word_Find_wordsearch.html - 2 word games with Maori words (one of them has 3 different views, e.g. print view)
1166? https://omniatlas.com/maps/australasia/18400206/plain/ - historical map with Maori iwi names over NZ map regions
1167? https://omniatlas.com/maps/australasia/18400206/ - historical map of Australia and NZ at the time of the Treaty of Waitangi, with events marked in English
1168 https://centrallanguageschool.com - AUTOTRANSLATED
1169 https://www.solasolv.com - Autotranslated product site
1170 http://mikestephens.co.uk/ - photo captions containing NZ placenames
1171 http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames
1172
1173--------------
1174
1175GETTING TABLE DATA OUT OF MONGO DB:
1176
1177https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo
1178"export to file" as in a spreadsheet like to a .csv?
1179
1180IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo):
1181
1182 1. In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything
1183
1184 2. paste everything into this website: https://json-csv.com/
1185
1186 3. click the download button and now you have it in a spreadsheet.
1187
1188
1189https://json-csv.com/
1190
1191
1192---------------------
1193
1194Count of websites that have at least 1 page containing at least one sentence detected as MRI
1195AND which websites have mi in the URL path:
1196
1197db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
1198
1199491
1200
1201
1202
1203# The websites that have some MRI detected AND which are either in NZ or with NZ TLD
1204# or (so if they're from overseas) don't contain /mi or mi.* in URL path:
1205
1206db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}]}).count()
1207396
1208
1209Include Australia (to get the valid "kiwiproperty.com" website included in the result list):
1210
1211db.getCollection('Websites').find({$and: [
1212 {numPagesContainingMRI: {$gt: 0}},
1213 {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
1214 ]}).count()
1215
1216397
1217
1218# aggregate results by a count of country codes
1219db.Websites.aggregate([
1220 {
1221 $match: {
1222 $and: [
1223 {numPagesContainingMRI: {$gt: 0}},
1224 {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
1225 ]
1226 }
1227 },
1228 { $unwind: "$geoLocationCountryCode" },
1229 {
1230 $group: {
1231 _id: {$toLower: '$geoLocationCountryCode'},
1232 count: { $sum: 1 }
1233 }
1234 },
1235 { $sort : { count : -1} }
1236]);
1237
1238
1239# Just considering those sites outside NZ or not with .nz TLD:
1240db.Websites.aggregate([
1241 {
1242 $match: {
1243 $and: [
1244 {geoLocationCountryCode: {$ne: "NZ"}},
1245 {domain: {$not: /\.nz/}},
1246 {numPagesContainingMRI: {$gt: 0}},
1247 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
1248 ]
1249 }
1250 },
1251 { $unwind: "$geoLocationCountryCode" },
1252 {
1253 $group: {
1254 _id: {$toLower: '$geoLocationCountryCode'},
1255 count: { $sum: 1 },
1256 domain: { $addToSet: '$domain' }
1257 }
1258 },
1259 { $sort : { count : -1} }
1260]);
1261
1262
1263# counts by country code excluding NZ related sites
1264db.getCollection('Websites').find({$and: [
1265 {geoLocationCountryCode: {$ne: "NZ"}},
1266 {domain: {$not: /\.nz/}},
1267 {numPagesContainingMRI: {$gt: 0}},
1268 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
1269 ]}).count()
1270
1271221 websites
1272
1273
1274# But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
1275db.getCollection('Websites').find({$and: [
1276 {numPagesContainingMRI: {$gt: 0}},
1277 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1278 ]}).count()
1279
1280176
1281
1282(Total is 221+176 = 397, which adds up).
1283
1284# Get the count (and domain listing) output put under a hardcoded _id of "nz":
1285db.Websites.aggregate([
1286 {
1287 $match: {
1288 $and: [
1289 {numPagesContainingMRI: {$gt: 0}},
1290 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1291 ]
1292 }
1293 },
1294 { $unwind: "$geoLocationCountryCode" },
1295 {
1296 $group: {
1297 _id: "nz",
1298 count: { $sum: 1 },
1299 domain: { $addToSet: '$domain' }
1300 }
1301 },
1302 { $sort : { count : -1} }
1303]);
1304
1305
1306-----------------------
1307US:
1308Done: manually inspected 68/117 sites
1309
1310TOTAL US: 4+7+7+4+3=25
1311
1312DEFINITELY:
1313+ http://anglicanhistory.org,
1314+ http://www.unicode.org, [Universal declaration of Human Rights]
1315+ https://static-promote.weebly.com,
1316+ http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY. At least short or partial MRI sentences.]
1317
1318BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations:
1319+ http://bahaiprayers.net, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too]
1320+ https://biblehub.com,
1321+ http://www.muhammad.com, [possibly not autotranslated]
1322+ http://www.godrules.net, [possibly not autotranslated]
1323+ http://m.biblepub.com,
1324+ http://www.krassotkin.ru, [probably real translations, as there are multiple Dutch translations from different sources provided]
1325+ http://www.gotquestions.org, [doesn't appear autotranslated]
1326X https://ebible.org, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori]
1327X https://www.bible.com, doesn't have Maori translation. Misdetected.
1328X https://wol.jw.org, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected
1329X https://png.bible, [misdetected, Papua New Guinea]
1330X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters.
1331
1332CHECK, PROBABLY HAS MRI - PROCESSED:
1333!! https://maorinews.com,
1334!! http://maaori.com,
1335!!+ http://kiaorahola.blogspot.com,
1336+ https://kjohnsonnz.blogspot.com,
1337+ http://pumanawawhangara.blogspot.com,
1338+ http://dannykahei.tripod.com,
1339+ http://burkekm001.tripod.com,
1340+ http://tkkpipipaopao.blogspot.com,
1341+ http://manateina.blogspot.com,
1342? tkkpipipaopao.blogspot.com? http://rangiwewehi.com, [English, but community]
1343? https://www.terakau.org, [COMMUNITY, but English]
1344? https://www.pipirikiapapatuanuku.org, [COMMUNITY?, in English, environment site]
1345~ http://georgegi.tripod.com,
1346~ http://ngarangatahi.tripod.com, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden]
1347X http://fhr.kiwicelts.com,
1348X http://tkrow.tripod.com, [English, background of NZ place]
1349X http://www.mkiwi.com, - placenames
1350X http://www.waimate.com, [English, NZ place]
1351
1352MAYBE HAS MRI, INSPECT - PROCESSED:
1353? https://www.natekore2018.com, [lots of English, but COMMUNITY, CULTURE]
1354+ http://tatai09.blogspot.com,
1355+ http://www.twttoa.com,
1356+ http://tuhua2010.blogspot.com,
1357X http://www.huapala.org, [misdetected, Hawaiian]
1358X https://www.vaihaunui.net, [misdetected, Tahiti]
1359X https://www.kaifineart.com, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected]
1360X http://mahoraroom8.blogspot.com, [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI
1361+ http://piripi.blogspot.com,
1362X http://www.hiroa.pf, [misdetected. Crawled content appears Polynesian not Maori]
1363X http://korora.econ.yale.edu, [NZ place photo caption]
1364X https://www.poehalisnami.ua, [mostly Cyrillic, with some NZ or Polynesian names misdetected]
1365X http://hannas-reiseblog.blogspot.com - one page contained NZ placenames, another had a word misdetected
1366
1367
1368+ https://www.breaker.audio, [audio, with occasional English.]
1369? https://livestream.com, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?]
1370
1371X https://docs.google.com, timetable with occasional Maori language word
1372+ https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position.
1373~+ http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
1374
1375
1376PINTEREST
1377+ https://in.pinterest.com/pin/317363104978423418/
1378 "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google"
1379? https://za.pinterest.com/pin/524669425310419500/
1380 Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE]
1381[The other pinterest detected as numPagesContainingMRI > 0 was misdetected]
1382
1383https://nl.pinterest.com,
1384https://www.pinterest.jp,
1385https://www.pinterest.it,
1386https://www.pinterest.co.uk,
1387https://www.pinterest.ca,
1388https://za.pinterest.com,
1389https://www.pinterest.fr,
1390https://in.pinterest.com,
1391
1392MORE BLOGSPOTS
1393X http://word-dialect.blogspot.com, [Indonesian, misdetected]
1394~ http://atopeconlostopes.blogspot.com, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.]
1395X http://lianzaconference2012.blogspot.com, [NZ placename or institution]
1396? http://mrshamiltonskoolkidz.blogspot.com, [te reo Maori related school activities. Described in English.]
1397X http://capsuraotearoa.blogspot.com, [blog in French, photo captions contain NZ placenames]
1398X http://blogdepasopor.blogspot.com, [blog in French, Rapa Nui/Easter Island related content, misdetected.]
1399
1400
1401UNLIKELY
1402?? http://naturalfatburner.net, http://naturalfatburner.net/NoNonsenseTed/fatloss-mao/ feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick]
1403
1404
1405BLACKLIST:
1406X http://ww25.milfsplease.com,
1407X http://www.the-naked.com
1408
1409OTHER:
1410X http://seapixonline.com, https://www.seapixonline.com, [photo captions of ships. Sometimes misdetected Japanese words as MRI.]
1411X http://www.code-postal.com, https://www.code-postal.com, [not more than 1 sentence detected as in MRI]
1412X https://www.dbnames.net, [Name database, lots misdetected]
1413
1414STILL TO DO LIST - PROCESSED:
1415
1416X https://www.myadsclassified.com, [misdetected 3 short English sentences as MRI]
1417X http://www.whoisthatr.com, [misdetected short English sentence as MRI]
1418X https://www.oemsec.com, [autotranslated product site]
1419X http://svenskadress.net, [linkfarm like site of related junk links, contained URLs misdetected as MRI]
1420
1421X https://www.webwiki.com, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ]
1422X http://mikebonnice.com, [Hawaiian and Tahiti related content misdetected]
1423X http://www.hudl.com, [misdetected short English sentence as MRI]
1424X http://www.wikitree.com, [misdetected short English sentence as MRI]
1425X http://shuttersportnelson.photoshelter.com, [image captions of "Wairua Warrior"]
1426
1427X http://niken8media.logdown.com, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.]
1428X https://www.podrozeady.com, Looks Polish or other East-European language. The NZ page https://www.podrozeady.com/NZ/4/ had placenames detected.
1429
1430X http://www.thesalmons.org, [detection and misdetection of author names of papers hosted]
1431
1432X http://linkvip.top, [.rar and media file links misdetected as MRI]
1433
1434
1435X http://www.lunar-occultations.com, [NZ place names for astronomical phenomena]
1436X http://shangrilapress.net, [NZ placenames]
1437X http://malecek.com, [misdetection CD title]
1438X https://www.blue-frontiers.com, [Tahitian, Reo Tahiti, misdetected as MRI]
1439X http://www.whoisentry.com, [URL names, looked at several which were probably misdetected as MRI]
1440X http://loquevendra318.com, [uses Google translate for auto-translation]
1441
1442
1443?? http://www.forensicfashion.com, [historical information, useful for CULTURE? e.g. http://www.forensicfashion.com/1807MaoriChief.html]
1444
1445X http://www.eyecontactsite.com, [Lots of names. And a few short sentences or words possibly in comments.]
1446X http://eartheum.com, [Rapa Nui, Easter Island related content. Misdetected]
1447X http://www.steve-wheeler.co.uk, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"]
1448X https://chromium.googlesource.com, [some source code related to languages' two letter codes]
1449
1450X http://www.roadsmile.com, [Lots of misdetection based on word Kia.]
1451?? https://www.knowatom.com, https://phet.colorado.edu [Similar looking science web sites for children. Uses auto-translation?]
1452
1453X https://www.indexmundi.com, [place names. Pages about Solomon Islands. Misdetection of placenames.]
1454
1455
1456
1457X http://wowwars.net, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ]
1458?? https://www.hidroponia.org.mx, [Not sure if https://www.hidroponia.org.mx/index.php/idiomas/284-hydroponics-te-ahurea-wai-maori is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?]
1459X http://www.v3whois.com, [URLs are misdetected as MRI]
1460X http://rhymebrain.com, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages]
1461
1462
1463X SINGLE SENTENCE DETECTED (NO MORE AND NOT PAGE:)
1464 http://frontrowphotos.com,
1465 http://www.pressreader.com,
1466 https://www.nccri.ie,
1467 http://takethatvacation.com,
1468 http://worldradiomap.com,
1469 http://www.namesdir.com,
1470
1471 X http://www.frogsonline.com, [NZ hotels, placenames]
1472 X http://www.geni.com, [Single sentence misdetection]
1473 X http://wikiedit.org, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui]
1474
1475
1476
1477---------------
1478All sites except NZ or .nz TLD where containingMRI=true manually inspected. Includes overseas sites with mi in URL path. All NZ sites passed through without inspection.
1479
1480MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY
1481NZ: 176
1482US: 25
1483AU: 3
1484FR: 1
1485DK: 2
1486(CA: 0.5)
1487DE: 2
1488IE (Ireland): 1
1489CZ: 1
1490ES: 1
1491BG: 1
1492
1493TIDIED:
1494NZ: 176
1495US: 25+4 from US with mi in URL path = 29
1496AU: 2
1497DE: 2
1498DK: 2
1499BG: 1
1500CZ: 1
1501ES: 1
1502FR: 1
1503IE: 1
1504TOTAL: 213+4 from US with mi in URL path = 216
1505
1506
1507------------------------------
1508
1509Need to inspect all those URLs with mi in URL path (mi.* or */mi) that are not sites with nz TLD or originating in NZ:
1510
1511db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
1512472
1513
1514(vs:
1515db.getCollection('Websites').find({$and: [{numPagesInMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
1516209)
1517
1518
1519db.Websites.aggregate([
1520 {
1521 $match: {
1522 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]
1523 }
1524 },
1525 {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}},
1526 { $sort : { count : -1} }
1527])
1528
1529
1530Of interest or possible interest:
1531US:
1532!! http://indigenousblogs.com [15/18 blogs work] - has one page in Maori (http://indigenousblogs.com/feeds/mi.xml)
1533X https://biblia.gospelprime.com.br - misdetection (containsMRI)
1534X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout
1535!! https://mi.m.wikipedia.org, https://mi.wikipedia.org
1536X https://usahello.org - autotranslated
1537X http://church-of-christ.org, http://www.church-of-christ.org - I think autotranslated, because "HET kerken van Christus" at https://church-of-christ.org/nl/ i.p.v. meervoud
1538X https://www.livehoster.com
1539X http://www.americasportsfloor.com, - product store. Misdetected
1540!! http://csunplugged.org, https://www.csunplugged.org - University of Canterbury NZ and site only available in EN, MI, DE, ES, CN
1541X https://mi.lawyers.cafe - autotranslated
1542 X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated
1543~! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated.
1544X http://jobdescriptionsample.org - autotranslated
1545X http://mi.broadcastbeat.com - autotranslated product site
1546X http://www.samewe.net - autotranslated product site
1547X https://mi.kidspicturedictionary.com - autotranslated, but MAY BE USEFUL
1548X https://www.rikoooo.com - autotranslated
1549
1550CN: -
1551
1552FR:
1553? https://mi.phcoker.com - product site "Shangke Chemical Rapu + 86 (1812) 4514114 [email protected]"
1554X http://www.gpedia.com - dodgy copy of wikipedia, see http://www.gpedia.com/nl/gpedia/Hoofdpagina
1555
1556NL:
1557X http://www.martinvrijland.nl - wordpress, autotranslated
1558
1559CA:
1560X https://www.wikiplanet.click (seems like a dodgy copy of wikipedia)
1561X cloudsfeed.com - wordpress admin page
1562
1563
1564db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]})
1565=> http://indigenousblogs.com/mi/
1566
1567--------------------------
1568
1569
1570db.Websites.aggregate([
1571 {
1572 $match: {
1573 $and: [
1574 {geoLocationCountryCode: {$ne: "NZ"}},
1575 {domain: {$not: /\.nz/}},
1576 {numPagesContainingMRI: {$gt: 0}},
1577 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
1578 ]
1579 }
1580 },
1581 { $unwind: "$geoLocationCountryCode" },
1582 {
1583 $group: {
1584 _id: {$toLower: '$geoLocationCountryCode'},
1585 count: { $sum: 1 },
1586 domain: { $addToSet: '$domain' },
1587 numPagesInMRI: { $addToSet: '$numPagesInMRI' },
1588 numPagesContainingMRI: { $addToSet: '$numPagesContainingMRI' },
1589 numPagesInMRICount: { $sum: '$numPagesInMRI' },
1590 numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
1591 }
1592 },
1593 { $sort : { count : -1} }
1594]);
1595
1596
1597To convert json to csv
1598In gedit replace
1599\/\*\s*\d+\s*\*\/ => ,
1600
1601----------
1602
1603https://www.techdirt.com/articles/20160413/12012834171/how-bad-are-geolocation-tools-really-really-bad.shtml
1604https://stackoverflow.com/questions/28740077/how-to-find-historical-geolocation-for-an-ip-address-perhaps-using-maxmind
1605https://serverfault.com/questions/59167/how-often-do-ip-blocks-get-reassigned-to-different-regions
1606
1607GEDIT: Regex find and replace at start
1608 "https?\:\/\/(www.)?
1609^[^"]*"https?\:\/\/(www.)?
1610
1611and at end
1612 ",
1613
1614-----------------------
1615GEOLOCATION CHANGES AFTER REINGESTING UPON INTRODUCING ANGLICAN.ORG:
1616-----------------------
1617NZ the same as before
1618 NL, DE, FR, DK, ES, GB same
1619 IT, AT, RO, CH, RU, BG, MX, JP, CN, IE, IR, FI same
1620
1621US gained 3 + 1 from mi in URL path:
1622+ anglican.org (NEW)
1623X articles.imperialtometric.com (from CA)
1624X daandehn.com (from CA)
1625+ kiwiproperty.com (from AU)
1626
1627CA lost 2:
1628X articles.imperialtometric.com (to US)
1629X daandehn.com (to US)
1630
1631AU:
1632! lost kiwiproperty.com (to US - mi in URL path version file!)
1633
1634
1635CZ:
1636X gained viveipcl.com (from UNKNOWN)
1637
1638UNKNOWN:
1639X gained hitiaotera.com from IL
1640
1641IL:
1642X lost one (hitiaotera.com to UNKNOWN)
1643
1644
1645FINAL SITE COUNT (contain >= 1 page with >= 1 MRI sentence)
1646
1647DK:
1648http://ngapuhiradio.com
1649http://ngapuhitelevision.com
1650 [http://akona.ngapuhitelevision.com
1651 http://waiatarangatiratanga.ngapuhitelevision.com
1652 http://jazz.ngapuhitelevision.com
1653 http://powhiri.ngapuhitelevision.com
1654 http://komisch.ngapuhitelevision.com]
1655
1656DE
1657http://www.udhr.de
1658https://www.cartogiraffe.com/
1659
1660AU
1661https://koreromaori.com
1662(https://infogram.com/)
1663
1664FR
1665http://chantsdeluttes.free.fr/
1666
1667ES
1668https://www.uv.es/
1669
1670IE
1671https://coggle.it
1672
1673CZ:
1674http://www.henryklahola.nazory.cz
1675
1676BG:
1677http://anitra.net/
1678
1679US finals:
1680http://anglican.org
1681http://anglicanhistory.org
1682http://www.unicode.org
1683https://static-promote.weebly.com
1684http://aclhokiangarocks.blogspot.com
1685http://bahaiprayers.net
1686https://biblehub.com
1687http://www.muhammad.com
1688http://www.godrules.net
1689http://m.biblepub.com
1690http://www.krassotkin.ru
1691http://www.gotquestions.org
1692https://maorinews.com
1693http://maaori.com
1694http://kiaorahola.blogspot.com
1695https://kjohnsonnz.blogspot.com
1696http://pumanawawhangara.blogspot.com
1697http://dannykahei.tripod.com
1698http://burkekm001.tripod.com
1699http://tkkpipipaopao.blogspot.com
1700http://manateina.blogspot.com
1701http://tatai09.blogspot.com
1702http://www.twttoa.com
1703http://tuhua2010.blogspot.com
1704http://piripi.blogspot.com
1705https://www.breaker.audio
1706https://drive.google.com
1707http://ritusehji.blogspot.com
1708https://in.pinterest.com
1709
171029
1711
1712https://www.kiwiproperty.com
1713http://indigenousblogs.com
1714https://mi.m.wikipedia.org, https://mi.wikipedia.org
1715http://csunplugged.org, https://www.csunplugged.org
1716(https://policies.oclc.org)
1717
171834 incl with MI in URL Path
1719
1720
1721---------------------
1722NZ:
1723 http://www.teipukarea.maori.nz
1724 http://ngatipahauwera.co.nz
1725 http://www.oag.govt.nz
1726 https://sexualviolence.victimsinfo.govt.nz
1727 http://tmoa.tki.org.nz
1728 http://www.tewhanake.maori.nz
1729 http://www.matarikifestival.org.nz
1730 http://www.otepoti.school.nz
1731 https://www.maoritelevision.com
1732 http://pukapuka.nz
1733 http://community.nzdl.org
1734 http://maori.livingheritage.org.nz [http://www.livingheritage.org.nz]
1735 http://pukoro.co.nz
1736 https://cdn.tehiku.nz [DOMAIN: tehiku.nz]
1737 http://www.runanga.co.nz
1738 http://kuraaiwi.maori.nz
1739 http://kurataiao.tki.org.nz
1740 http://satellites.co.nz
1741 http://teaohou.natlib.govt.nz
1742 http://www.tuwharetoa.iwi.nz
1743 https://www.terito.school.nz
1744 https://ttw1.cwp.govt.nz
1745 https://www.whanau-tahi.school.nz
1746 https://e-ako-pangarau.nzmaths.co.nz
1747 https://teaomaori.news
1748 http://tetaurawhiri.govt.nz
1749 https://www.tuiatematangi.ac.nz
1750 http://animations.tewhanake.maori.nz
1751 https://www.dnc.org.nz
1752 http://firstworldwar.tki.org.nz [http://www.firstworldwar.tki.org.nz]
1753 http://www.28maoribattalion.org.nz
1754 http://www.tewikiotereomaori.co.nz
1755 http://www.brettgraham.co.nz
1756 https://hepatakakupu.nz
1757 http://anglicanprayerbook.nz
1758 http://arataua.nz
1759 http://maori.tki.org.nz
1760 https://paekupu.co.nz
1761 https://haereheikaiako.co.nz
1762 https://curriculumtool.education.govt.nz
1763 http://kurakokiri.maori.nz [includes: http://www.kurakokiri.maori.nz]
1764 http://www.kkmmaungarongo.co.nz
1765 http://www.heartland.co.nz
1766 http://oilcrash.com
1767 http://www.kura-porirua.school.nz
1768 https://www.sporty.co.nz
1769 https://www.tematawai.maori.nz
1770 https://www.terakipaewhenua.school.nz
1771 http://www.tetaurawhiri.govt.nz
1772 http://archive.stats.govt.nz
1773 http://tiritiowaitangi.govt.nz
1774 http://www.waiata.maori.nz [includes: http://waiata.maori.nz]
1775 http://hana.co.nz
1776 http://kaupare.co.nz
1777 http://www.tereowrap.nz
1778 http://www.hrc.co.nz
1779 http://ngatiporoukiponeke.org.nz
1780 http://rurued.school.nz
1781 http://www.twtop.school.nz
1782 http://www.huri-translations.pf
1783 https://teara.govt.nz/ [https://admin.teara.govt.nz, http://blog.teara.govt.nz]
1784 https://tiritiowaitangi.govt.nz
1785 http://www.tmoa.tki.org.nz
1786 https://www.komako.org.nz
1787 http://www.wcl.govt.nz [included: http://kete.wcl.govt.nz]
1788 http://punareo.co.nz
1789 https://rapuatearatika.education.govt.nz
1790 http://tmmkkm.school.nz
1791 http://www.cs.waikato.ac.nz
1792 http://www.kupengahao.co.nz
1793 https://www.hapuhauora.health.nz
1794 http://cms.sunsmartschools.co.nz [http://sunsmartschools.co.nz/]
1795 http://kuraproductions.co.nz
1796 https://keepourmoneyclean.govt.nz
1797 http://www.tekura.school.nz
1798 http://www.tkkmmokopuna.school.nz
1799 http://hangaraumatihiko.tki.org.nz
1800 http://www.pakanae.maori.nz
1801
1802
1803 http://holyspirit.nz
1804 https://www.ngamanawainc.co.nz, [includes http://www.ngamanawainc.co.nz]
1805 http://www.finlaysonpark.school.nz
1806 http://www.w3vietnam.org.nz [includes http://w3vietnam.org.nz]
1807 https://www.takitimu.ac.nz
1808 https://kotahimiriona.co.nz
1809 https://rehuamarae.co.nz
1810 http://reoora.co.nz
1811
1812 https://manawatuheritage.pncc.govt.nz
1813 http://rsnz.natlib.govt.nz
1814 https://www.taitokerautrust.org.nz
1815 http://tewikiotereomaori.nz
1816 https://www.korokikahukura.co.nz
1817 https://www.pinterest.nz
1818 https://www.rereahu.maori.nz
1819 http://givealittle.co.nz
1820 https://kaiiwicamp.nz [includes http://kaiiwicamp.nz]
1821 http://ngarauhuia.ngatiapakiterato.iwi.nz
1822 https://m.wairarapatv.co.nz
1823
1824 http://avonside.net
1825 http://www.maoriinvestments.co.nz
1826 http://conference.tpwt.maori.nz
1827 https://www.puau.school.nz
1828 http://tehauora.org.nz
1829
1830 http://temahurehure.maori.nz
1831 http://www.temarareo.org
1832 http://www.tetaumuturunanga.iwi.nz
1833 http://www.writersfestival.co.nz
1834 http://www.kmk.maori.nz
1835 https://www.stats.govt.nz [includes http://archive.stats.govt.nz]
1836
1837+? http://ngatiwhakaue.iwi.nz
1838+? https://interactives.stuff.co.nz
1839+? http://whatonga.school.nz
1840+? https://player.vimeo.com
1841+? http://southerntribes.co.nz
1842
1843?X https://www.e-agent.nz [includes: https://office.e-agent.nz, http://videos.e-agent.nz]
Note: See TracBrowser for help on using the repository browser.