source: other-projects/maori-lang-detection/MoreReading/mongodb.txt@ 33823

Last change on this file since 33823 was 33823, checked in by ak19, 14 months ago

Recommitting mongo-data folder with renamed files with numbering.

File size: 60.3 KB
Line 
1MongoDB
2Installation:
3 https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
4 https://docs.mongodb.com/manual/administration/install-on-linux/
5 https://hevodata.com/blog/install-mongodb-on-ubuntu/
6 https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04
7 CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/
8 FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source
9GUI:
10 https://robomongo.org/
11 Robomongo is Robo 3T now
12
13https://www.tutorialspoint.com/mongodb/mongodb_java.htm
14JAR FILE:
15 http://central.maven.org/maven2/org/mongodb/mongo-java-driver/
16 https://mongodb.github.io/mongo-java-driver/
17
18
19
20https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
21http://www.programmersought.com/article/6500308940/
22
23 52 sudo apt-get install mongodb-clients
24 53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
25
26Failed with
27 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
28 exception: connect failed
29
30This is due to a version incompatibility between Client and mongodb Server.
31The solution is to follow instructions at http://www.programmersought.com/article/6500308940/
32and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
33as below:
34
35 54 sudo apt-get purge mongodb-clients
36 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
37 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
38 57 sudo apt-get update
39 58 sudo apt-get install mongodb-clients
40 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
41(still doesn't work)
42 60 sudo apt-get install -y mongodb-org
43The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server?
44 72 sudo service mongod status
45
46 103 sudo service mongod start
47"mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections
48 104 sudo service mongod status
49 88 sudo service mongod stop
50
51
52DETAILS:
53
54wharariki:[879]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
55
56didn't work with the pwd. Failed with:
57
58 MongoDB shell version: 2.6.10
59 Enter password:
60 connecting to: mongodb://mongodb.cms.waikato.ac.nz:27017
61 2019-11-04T20:02:47.970+1300 Assertion: 13110:HostAndPort: host is empty
62 2019-11-04T20:02:47.970+1300 0x6b75c9 0x659e9f 0x636f69 0x4fa55c 0x501249 0x4fa7f1 0x6006fd 0x5eb869 0x7f7bfbd47d76 0x1f3c10d06362
63 mongo(_ZN5mongo15printStackTraceERSo+0x39) [0x6b75c9]
64 mongo(_ZN5mongo10logContextEPKc+0x21f) [0x659e9f]
65 mongo(_ZN5mongo11msgassertedEiPKc+0xd9) [0x636f69]
66 mongo(_ZN5mongo16ConnectionString12_fillServersENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50c) [0x4fa55c]
67 mongo(_ZN5mongo16ConnectionStringC1ENS0_14ConnectionTypeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_+0x99) [0x501249]
68 mongo(_ZN5mongo16ConnectionString5parseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERS6_+0x201) [0x4fa7f1]
69 mongo(_ZN5mongo17mongoConsExternalEPNS_7V8ScopeERKN2v89ArgumentsE+0x11d) [0x6006fd]
70 mongo(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0xa9) [0x5eb869]
71 /usr/lib/libv8.so.3.14.5(+0x99d76) [0x7f7bfbd47d76]
72 [0x1f3c10d06362]
73 2019-11-04T20:02:47.971+1300 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
74 exception: connect failed
75
76
77This is due to a version incompatibility between Client and mongodb Server.
78Can find client version above. (2.6.10)
79Server version can be found by running the mongo client shell. Doing so without loading a db:
80
81
82 wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb
83 MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION
84 type "help" for help
85 > help
86 db.help() help on db methods
87 db.mycoll.help() help on collection methods
88 sh.help() sharding helpers
89 rs.help() replica set helpers
90 help admin administrative help
91 help connect connecting to a db help
92 help keys key shortcuts
93 help misc misc things to know
94 help mr mapreduce
95
96 show dbs show database names
97 show collections show collections in current database
98 show users show users in current database
99 show profile show most recent system.profile entries with time >= 1ms
100 show logs show the accessible logger names
101 show log [name] prints out the last segment of log in memory, 'global' is default
102 use <db_name> set current database
103 db.foo.find() list objects in collection foo
104 db.foo.find( { a : 1 } ) list objects in foo where a == 1
105 it result of the last line evaluated; use to further iterate
106 DBQuery.shellBatchSize = x set default number of items to display on shell
107 exit quit the mongo shell
108
109 > help connect
110
111 Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options.
112 Additional connections may be opened:
113
114 var x = new Mongo('host[:port]');
115 var mydb = x.getDB('mydb');
116 or
117 var mydb = connect('host[:port]/mydb');
118
119 Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
120
121 Getting help on connect options:
122
123 > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017');
124 > var mydb = x.getDB('anupama');
125
126 > mydb.connect.help()
127 DBCollection help
128 db.connect.find().help() - show DBCursor help
129 db.connect.count()
130 db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
131 db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command
132 db.connect.dataSize()
133 db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' )
134 db.connect.drop() drop the collection
135 db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } )
136 db.connect.dropIndexes()
137 db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
138 db.connect.reIndex()
139 db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
140 e.g. db.connect.find( {x:77} , {name:1, x:1} )
141 db.connect.find(...).count()
142 db.connect.find(...).limit(n)
143 db.connect.find(...).skip(n)
144 db.connect.find(...).sort(...)
145 db.connect.findOne([query])
146 db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
147 db.connect.getDB() get DB object associated with collection
148 db.connect.getPlanCache() get query plan cache associated with collection
149 db.connect.getIndexes()
150 db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
151 db.connect.insert(obj)
152 db.connect.mapReduce( mapFunction , reduceFunction , <optional params> )
153 db.connect.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
154 db.connect.remove(query)
155 db.connect.renameCollection( newName , <dropTarget> ) renames the collection.
156 db.connect.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
157 db.connect.save(obj)
158 db.connect.stats()
159 db.connect.storageSize() - includes free space allocated to this collection
160 db.connect.totalIndexSize() - size in bytes of all the indexes
161 db.connect.totalSize() - storage allocated for all data and indexes
162 db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi
163 db.connect.validate( <full> ) - SLOW
164 db.connect.getShardVersion() - only for use with sharding
165 db.connect.getShardDistribution() - prints statistics about data distribution in the cluster
166 db.connect.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
167 db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
168 db.connect.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
169 db.connect.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
170 > mydb.version()
171 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION
172
173(Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb)
174
175Finally we now know the mongodb server version: 4.0.13
176This version doesn't work with our mongo client (shell) version of 2.6.10.
177
178
179DETAILS OF INSTALLING MONGO-CLIENT AND UPDATING IT, AND INSTALLING MONGODB SERVER:
180
181
182 54 sudo apt-get purge mongodb-clients
183 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
184 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
185 57 sudo apt-get update
186 58 sudo apt-get install mongodb-clients
187 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
188 60 sudo apt-get install -y mongodb-org
189 61 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
190 62 sudo service apache2 status
191 63 sudo service sshd status
192 64 sudo service mongodb status
193 65 sudo service mongo status
194 66 mongod
195 67 mongod --help
196 68 mongod --help | less
197 69 mongod -f /etc/mongod.conf
198 70 sudo mongod -f /etc/mongod.conf
199 71 less /etc/mongod.conf
200 72 sudo service mongod status
201 73 sudo service mongod start
202 74 sudo service mongod status
203 75 ls -l /var/log/mongodb/mongod.log
204 76 sudo rm /var/log/mongodb/mongod.log
205 77 sudo service mongod status
206 78 sudo service mongod start
207 79 sudo service mongod status
208 80 sudo service mongod stop
209 81 ps auxww | grep mongo
210 82 sudo service mongod start
211 83 sudo service mongod status
212 84 ps auxww | grep mongo
213 85 sudo dmsg
214 86 sudo dmesg
215 87 sudo service mongod status
216 88 sudo service mongod stop
217 89 sudo service mongod start
218 90 sudo dmesg
219 91 sudo less /var/log/mongodb/mongod.log
220 92 ls /var/lib/
221 93 ls -ld /var/lib/
222 94 ls -l /var/log/mongodb/mongod.log
223 95 ls -ld /var/lib/
224 96 groups mongodb
225 97 less /etc/mongod.conf
226 98 sudo less /var/log/mongodb/mongod.log
227 99 less /etc/mongod.conf
228 100 ls -l /var/lib/mongodb/
229 101 sudo chown -R mongodb /var/lib/mongodb/
230 102 sudo chgrp -R mongodb /var/lib/mongodb/
231 103 sudo service mongod start
232 104 sudo service mongod status
233 105 history
234
235
236
237MONGO DB ROBO 3T
2381. Download "Double Pack" from https://robomongo.org/
2392. Untar its contents. Then untar the tarball in that.
2403. Run:
241 wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t
242
243===================
244On analytics, vagrant node1, we've installed the mongodb server and client.
245We're able to successfully create collections on here.
246
247
248vagrant@node1:~$ mongo
249MongoDB shell version v4.2.1
250connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
251Implicit session: session { "id" : UUID("87bb585c-4685-47f6-bf89-a93801daeb2d") }
252MongoDB server version: 4.2.1
253Server has startup warnings:
2542019-11-04T07:48:14.197+0000 I STORAGE [initandlisten]
2552019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2562019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** See http://dochub.mongodb.org/core/prodnotes-filesystem
2572019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
2582019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database.
2592019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted.
2602019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
261---
262Enable MongoDB's free cloud-based monitoring service, which will then receive and display
263metrics about your deployment (disk utilization, CPU, operation statistics, etc).
264
265The monitoring data will be available on a MongoDB website with a unique URL accessible to you
266and anyone you share the URL with. MongoDB may use this information to make product
267improvements and to suggest MongoDB products and deployment options to you.
268
269To enable free monitoring, run the following command: db.enableFreeMonitoring()
270To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
271---
272
273> show dbs
274admin 0.000GB
275config 0.000GB
276local 0.000GB
277> use db ateacrawldata
2782019-11-05T05:24:20.155+0000 E QUERY [js] Error: [db ateacrawldata] is not a valid database name :
279Mongo.prototype.getDB@src/mongo/shell/mongo.js:51:12
280getDatabase@src/mongo/shell/session.js:913:28
281DB.prototype.getSiblingDB@src/mongo/shell/db.js:22:12
282shellHelper.use@src/mongo/shell/utils.js:803:10
283shellHelper@src/mongo/shell/utils.js:790:15
284@(shellhelp2):1:1
285> db.createCollection('webpages');
286{ "ok" : 1 }
287> db.webpages.drop();
288... ^C
289
290> db.webpages.drop();
291true
292> use ateacrawldata
293switched to db ateacrawldata
294> db.createCollection('webpages');
295{ "ok" : 1 }
296> show collections
297webpages
298> db.createCollection('websites');
299{ "ok" : 1 }
300>
301
302------------------------
303
304Ask Clint to rename "anupama" database to "ateacrawldata" database following the instructions at:
305 https://stackoverflow.com/questions/9201832/how-do-you-rename-a-mongodb-database
306I don't have permissions to do this.
307Nor do I have permissions to create Mongo collections within a new database that I create, like ateacrawldata.
308I only seem to have rights to the "anupama" database.
309
310
311
312-----------------------
313Vagrant virtual machine Node1 has the mongodb installed.
314
315After doing "vagrant up" on node1 to start node1:
316
317 [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh
318 vagrant@node1:~$ mongo
319 MongoDB shell version v4.2.1
320 connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
321 2019-11-13T09:22:46.996+0000 E QUERY [js] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused :
322 connect@src/mongo/shell/mongo.js:341:17
323 @(connect):2:6
324 2019-11-13T09:22:46.999+0000 F - [main] exception: connect failed
325 2019-11-13T09:22:46.999+0000 E - [main] exiting with code 1
326 vagrant@node1:~$ sudo service mongod status
327 ● mongod.service - MongoDB Database Server
328 Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
329 Active: inactive (dead)
330 Docs: https://docs.mongodb.org/manual
331 vagrant@node1:~$ sudo service mongod start
332 vagrant@node1:~$ sudo service mongod status
333 ● mongod.service - MongoDB Database Server
334 Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
335 Active: active (running) since Wed 2019-11-13 09:24:07 UTC; 2s ago
336 Docs: https://docs.mongodb.org/manual
337 Main PID: 4383 (mongod)
338 Tasks: 32
339 Memory: 199.3M
340 CPU: 754ms
341 CGroup: /system.slice/mongod.service
342 └─4383 /usr/bin/mongod --config /etc/mongod.conf
343
344 Nov 13 09:24:07 node1 systemd[1]: Started MongoDB Database Server.
345 vagrant@node1:~$
346
347
348So now mongodb is running on node1 on localhost:27017.
349
350Next, in another x-term connected to analytics' node1 Vagrant VM, port forward node1's localhost:27017 to analytics' localhost:27017:
351 vagrant ssh -- -L 27017:localhost:27017
352
353
354
355Finally, in another x-term, port-forward from analytics:27017 to current machine's 27017:
356 ssh -L 27017:localhost:27017 analytics
357
358
359Now can connect Robo-3T running on current machine to localhost:27017.
360
361Then in a new x-term, can use the client mongo shell to connect (by default to localhost:27017):
362
363 wharariki:[122]/Scratch/ak19/GS309>mongo --shell
364 MongoDB shell version v4.0.13
365 connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
366 ...
367 > show dbs
368 admin 0.000GB
369 ateacrawldata 1.532GB
370 config 0.000GB
371 local 0.000GB
372 > use ateacrawldata
373
374 > show collections
375 Webpages
376 Websites
377 oldwebpages
378 oldwebsites
379-------------------
380
381Country code to geolocation CSV file found by Dr Bainbridge:
382https://developers.google.com/public-data/docs/canonical/countries_csv
383
384Import into mongodb with:
385https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv
386
387
388
389NOTE: mongoimport is a commandline utility and not a command to be run from the mongo shell. See https://jira.mongodb.org/browse/DOCS-11072
390This means, in an x-term, DON'T RUN MONGO SHELL/client first. Instead, directly from x-term, run the following to import the countrycodes.csv file:
391
392
393 mongoimport -d ateacrawldata -c countrylocations --type csv --file /Scratch/ak19/maori-lang-detection/MoreReading/countrycodes.csv --headerline
394
395
396-------------------------
397
398MONGODB QUERIES:
399
400db.getCollection('webpages').find({"isMRI": true, "singleSentences.langCode": "mri"})
401db.getCollection('webpages').find({"singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"})
402db.getCollection('Webpages').find({"isMRI": true, "singleSentences": { $elemMatch: {"langCode":"eng"} } }, {"singleSentences.$": "eng"}) [single English lang sentence]
403db.getCollection('Webpages').find({"containsMRI": true, "singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) [gets 1st sentence of docs which have sentences containing MRI]
404
405
406READING
407
408mongodb java convert class
409https://www.quora.com/What-are-the-ways-of-converting-a-Java-object-to-a-MongoDB-document-and-vice-versa
410https://stackoverflow.com/questions/39320825/pojo-to-org-bson-document-and-vice-versa
411X https://mongodb.github.io/morphia/
412https://stackoverflow.com/questions/10170506/inserting-java-object-to-mongodb-collection-using-java
413X https://www.google.com/search?q=morphia+example&oq=morphia+example&aqs=chrome.0.0l6.4223j0j9&sourceid=chrome&ie=UTF-8
414https://www.baeldung.com/mongodb-morphia
415X https://web.archive.org/web/20171117121335/http://mongodb.github.io/morphia/1.3/getting-started/
416=> https://morphia.dev/1.4/getting-started/quick-tour/
417https://github.com/MorphiaOrg/morphia/tree/master/docs/reference
418
419
420mongodb querying
421https://docs.mongodb.com/manual/tutorial/query-embedded-documents/
422https://docs.mongodb.com/manual/tutorial/query-arrays/
423https://www.google.com/search?q=mongodb+find+subdocument&oq=mongodb+find+&aqs=chrome.0.69i59j69i57j0l4.7607j1j8&sourceid=chrome&ie=UTF-8
424https://stackoverflow.com/questions/25586901/how-to-find-document-and-single-subdocument-matching-given-criterias-in-mongodb
425https://stackoverflow.com/questions/21113543/mongodb-get-subdocument
426https://stackoverflow.com/questions/36948856/find-subdocuments-in-mongo
427https://docs.mongodb.com/v3.0/reference/operator/projection/positional/#proj._S_
428https://www.google.com/search?q=mongodb+query+tutorial&oq=mongodb+query+tutorial&aqs=chrome..69i57j0l2j69i60l3.4719j0j7&sourceid=chrome&ie=UTF-8
429https://blog.exploratory.io/an-introduction-to-mongodb-query-for-beginners-bd463319aa4c
430https://docs.mongodb.com/manual/reference/method/db.collection.find/
431https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection
432https://stackoverflow.com/questions/39641925/mongodb-aggregation-framework-to-get-frequencies-of-fields-values
433
434https://exploratory.io/note/kanaugust/0961813761939766
435https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/
436https://docs.mongodb.com/manual/aggregation/
437
438
439Mongo Studio 3T documentation:
440https://studio3t.com/download/ (also has uninstall information)
441https://studio3t.com/download-thank-you/?OS=x64
442
443Google: MongoDB visualization
444MongoDB visualization map
445MongoDB Charts
446 (Open source visualisation tools)
447
448json map visualizer
449 geojson.tools
450-------------------
451
452Some queries with results:
453
454# Num websites
455db.getCollection('Websites').find({}).count()
4561445
457
458# Num webpages
459db.getCollection('Webpages').find({}).count()
460X75139
461117496
462
463# Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI)
464db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
465361
466
467# Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
468db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
469868
470
471# Obviously, the union of the above two will be identical to numPagesContainingMRI:
472db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
473868
474
475# Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
476db.getCollection('Webpages').find({isMRI:true}).count()
477X5224
478X5215
479db.getCollection('Webpages').find({isMRI:true}).count()
4807818
481
482# Number of pages that contain any number of MRI sentences
483db.getCollection('Webpages').find({containsMRI: true}).count()
484X12858
48520371
486
487
488# Number of sites with URLs containing /mi(/)
489db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
490X 153
491# Number of sites with URLs containing /mi(/) OR http(s)://mi.*
492db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
493670
494
495# Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls
496db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
497X 147
498# Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* in any of its sub-urls
499db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
500656
501
502# 6 sites with URLs containing /mi(/) that are in NZ
503db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count()
504X 6
505# 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ
50614
507
508
509# sort websites that contain /mi(/) in path by geoLocationCountryCode
510# https://www.quackit.com/mongodb/tutorial/mongodb_sort_query_results.cfm
511db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).sort({geoLocationCountryCode: 1})
512
513Actually, I want to sort by count. See https://docs.mongodb.com/manual/reference/operator/aggregation/sortByCount/
514
515
516# PROJECTION:
517db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInPath: 1})
518
519https://docs.mongodb.com/manual/aggregation/
520EXAMPLE:
521db.orders.aggregate([
522 { $match: { status: "A" } },
523 { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
524])
525
526X db.Websites.aggregate([{ $match:{urlContainsLangCodeInPath:true}}, $group: {geoLocationCountryCode:1, total: $count}])
527
528
529X db.Websites.aggregate([
530 { $match:{urlContainsLangCodeInPath:true}},
531 {$group: {geoLocationCountryCode:1}}
532])
533
534WORKS (but an "unwind" will get rid of "null"):
535db.Websites.aggregate([
536 { $match:{urlContainsLangCodeInPath:true}},
537 {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}}},
538 { $sort : { count : -1} }
539])
540
541
542# COUNT OF ALL GEOLOCATION COUNTRIES
543#https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
544 # LIST
545 db.Websites.distinct('geoLocationCountryCode');
546
547 # COUNT
548 db.Websites.distinct('geoLocationCountryCode').length;
549
550 # A COUNT WITH QUERY - https://docs.mongodb.com/manual/reference/command/distinct/#dbcmd.distinct
551
552 db.runCommand ( { distinct: "Websites", key: "geoLocationCountryCode", query: { "urlContainsLangCodeInPath": true} } );
553
554 # DISTINCT WITH QUERY WITHOUT COUNT - https://docs.mongodb.com/manual/reference/method/db.collection.distinct/
555 db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true});
556
557 #SORTED - https://stackoverflow.com/questions/4759437/get-distinct-values-with-sorted-data
558 db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}).sort();
559
560
561 # count of all sites for which the geolocation is UNKNOWN
562 db.getCollection('Websites').find({geoLocationCountryCode: {$eq:"UNKNOWN"}}).count()
563
564
565# AGGREGATION QUERIES THAT WORK:
566#https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
567
568WORKS:
569// count of country codes for all sites
570db.Websites.aggregate([
571
572 { $unwind: "$geoLocationCountryCode" },
573 {
574 $group: {
575 _id: "$geoLocationCountryCode",
576 count: { $sum: 1 }
577 }
578 },
579 { $sort : { count : -1} }
580]);
581
582// count of country codes for sites that have at least one page detected as MRI
583
584db.Websites.aggregate([
585 {
586 $match: {
587 numPagesInMRI: {$gt: 0}
588 }
589 },
590 { $unwind: "$geoLocationCountryCode" },
591 {
592 $group: {
593 _id: {$toLower: '$geoLocationCountryCode'},
594 count: { $sum: 1 }
595 }
596 },
597 { $sort : { count : -1} }
598]);
599
600// count of country codes for sites that have at least one page containing at least one sentence detected as MRI
601db.Websites.aggregate([
602 {
603 $match: {
604 numPagesContainingMRI: {$gt: 0}
605 }
606 },
607 { $unwind: "$geoLocationCountryCode" },
608 {
609 $group: {
610 _id: {$toLower: '$geoLocationCountryCode'},
611 count: { $sum: 1 }
612 }
613 },
614 { $sort : { count : -1} }
615]);
616
617
618WORKS:
619// count of country codes for sites that have /mi(/) or http(s)://mi.* in URL path
620
621db.Websites.aggregate([
622 {
623 $match: {
624 urlContainsLangCodeInPath: true
625 }
626 },
627 { $unwind: "$geoLocationCountryCode" },
628 {
629 $group: {
630 _id: {$toLower: '$geoLocationCountryCode'},
631 count: { $sum: 1 }
632 }
633 },
634 { $sort : { count : -1} }
635]);
636
637
638WORKS:
639db.Websites.aggregate([
640 {
641 $match: {
642 geoLocationCountryCode: {$ne : "UNKNOWN"}
643 }
644 },
645 { $unwind: "$geoLocationCountryCode" },
646 {
647 $group: {
648 _id: "$geoLocationCountryCode",
649 count: { $sum: 1 }
650 }
651 },
652 { $sort : { count : -1} }
653]);
654
655WORKS:
656db.Websites.aggregate([
657 {
658 $match: {
659 "urlContainsLangCodeInPath": true
660 }
661 },
662 { $unwind: "$geoLocationCountryCode" },
663 {
664 $group: {
665 _id: "$geoLocationCountryCode",
666 count: { $sum: 1 }
667 }
668 },
669 { $sort : { count : -1} }
670]);
671
672
673KEEP ADDITIONAL FIELDS - https://stackoverflow.com/questions/16662405/mongo-group-query-how-to-keep-fields:
674 a. KEEPS ONLY FIRST DOMAIN URL FOR EACH COUNTED COUNTRY CODE:
675
676 db.Websites.aggregate([
677 {
678 $match: {
679 "urlContainsLangCodeInPath": true
680 }
681 },
682 { $unwind: "$geoLocationCountryCode" },
683 {
684 $group: {
685 _id: "$geoLocationCountryCode", count: { $sum: 1 },
686 domain: {$first: '$domain'}
687 }
688 },
689 { $sort : { count : -1} }
690 ]);
691
692 b. KEEP ALL DOMAIN URLS:
693 db.Websites.aggregate([
694 {
695 $match: {
696 "urlContainsLangCodeInPath": true
697 }
698 },
699 { $unwind: "$geoLocationCountryCode" },
700 {
701 $group: {
702 _id: "$geoLocationCountryCode",
703 count: { $sum: 1 },
704 domain: { $addToSet: '$domain' }
705 }
706 },
707 { $sort : { count : -1} }
708 ]);
709
710
711# WANT TO GET THE ABOVE INTO WORLD MAP, use geojson.tools found by Dr Bainbridge
712geojson.tools
713USAGE: https://www.here.xyz/viewer-tool/
714
715
716AIMS:
717* Identify where Maori language is online.
718* How can we identify high quality sites that would be good for a corpus.
719(Related work for other languages to quantifiably answer that)
720
721data-preparation
722docs
723
724
725------------------------------------------
726
727BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
728---
729
730# https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or
731# https://docs.mongodb.com/manual/reference/operator/query/and/
732
733# 1. all the websites which are from NZ:
734db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count()
735128
736
737# 2. all the websites that have /mi in URL path which are from NZ:
738db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]})
7396
740
741# 3. all the websites that don't have /mi in URLpath
742db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count()
7431292
744
745# 4. all the websites that don't have /mi, or if they do are from NZ
746# (should be the sum of the above points 2 and 3 above)
747db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count()
7481298
749
750# 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ
751# These are the TENTATIVE NON-PRODUCT SITES
752# Should be less than the point 4, but more than 1 to 3
753
754db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count()
755X 859
756
757Now with http(s)://mi.* also excluded, the above query returns a count of:
758389
759
760
761BUT THIS IS THE CORRECT VERSION OF THE QUERY:
762db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {urlContainsLangCodeInPath: false}]}]}).count()
763389
764
765
766# 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix)
767# Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
768db.Websites.aggregate([
769 {
770 $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}
771 },
772 { $unwind: "$geoLocationCountryCode" },
773 {
774 $group: {
775 _id: {$toLower: '$geoLocationCountryCode'},
776 count: { $sum: 1 }
777 }
778 },
779 { $sort : { count : -1} }
780]);
781
782The result is very close to the same aggregate on just numPagesContainingMRI.
783
784That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few:
785
786db.Websites.aggregate([
787 {
788 $match: {
789 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]
790 }
791 },
792 { $unwind: "$geoLocationCountryCode" },
793 {
794 $group: {
795 _id: {$toLower: '$geoLocationCountryCode'},
796 count: { $sum: 1 }
797 }
798 },
799 { $sort : { count : -1} }
800]);
801
802
803_id count
804us 4.0
805nz 4.0
806au 3.0
807ru 1.0
808de 1.0
809
810Total: 13 sites that have /mi/ and are detected as having MRI content,
811db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
81213
813
814Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI.
815
816
817Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD!
818/* 1 */
819{
820 "_id" : "nz",
821 "count" : 4.0,
822 "domain" : [
823 "http://firstworldwar.tki.org.nz",
824 "http://www.firstworldwar.tki.org.nz",
825 "https://admin.teara.govt.nz",
826 "http://community.nzdl.org"
827 ]
828}
829
830/* 2 */
831{
832 "_id" : "us",
833 "count" : 4.0,
834 "domain" : [
835 "https://sexualviolence.victimsinfo.govt.nz",
836 "https://follow3rs.com",
837 "http://www.church-of-christ.org",
838 "http://www.mytrickstips.com"
839 ]
840}
841
842/* 3 */
843{
844 "_id" : "au",
845 "count" : 3.0,
846 "domain" : [
847 "https://rapuatearatika.education.govt.nz",
848 "https://www.kiwiproperty.com",
849 "https://curriculumtool.education.govt.nz"
850 ]
851}
852
853/* 4 */
854{
855 "_id" : "ru",
856 "count" : 1.0,
857 "domain" : [
858 "http://www.treningmozga.com"
859 ]
860}
861
862/* 5 */
863{
864 "_id" : "de",
865 "count" : 1.0,
866 "domain" : [
867 "http://www.almancax.com" # Website to learn German, autotranslated
868 ]
869}
870
871
872But we're not catching a potentially large number of auto-translated sites, like
873- https://www.gigalight.com/all-languages.html
874- http://www.hzhinew.com/
875
876https://culturesconnection.com/manual-or-automatic-translation/
877Manual Or Automatic Translation?
878
879Automatic translation continues to improve day by day. However, it is still unable to reach perfect levels of accuracy and lacks a natural feel. Will it ever replace human translation?
880
881--------------
882Mr Bill Rogers' suggestions for beginnings of trying to sieve out the auto-translated sites:
883- skip .com. .co.<tld>. But .co.nz is also used for non-commercial sites or sites that nevertheless have high quality Maori language content.
884- change cut-off value of OpenNLP language prediction? But for sentences and overlapping sentences, we're not using the cut-off value, we're just checking the best predicted language regardless of confidence level for this.
885
886- Code for (a range of) loading of language options in auto-translated sites?
887
888====================
889
890# https://stackoverflow.com/questions/20175122/how-can-i-use-not-like-operator-in-mongodb
891
892Info on the sites with Maori language content that are either from NZ or have .nz domain (TLD):
893
894 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]})
895
896 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}).count()
897 183
898
899Inverse: the sites detected as containing at least 1 Maori language sentence that are NOT from NZ NOR have .nz domain ending (TLD):
900 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}]}).count()
901 685
902
903The above two figures correctly add up to a total of 868 sites, which is the number of sites detected as containing at least 1 sentence in MRI:
904 db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
905 868
906
907Without those with /mi in path:
908 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]}).count()
909
910Now let's get a listing of all 685 sites to be manually inspected to determine whether they're auto-translated:
911
912/*
913db.Websites.aggregate([
914 {
915 $match: {
916 $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]
917 }
918 },
919 { $unwind: "$geoLocationCountryCode" },
920 {
921 $group: {
922 _id: {$toLower: '$geoLocationCountryCode'},
923 count: { $sum: 1 },
924 domain: { $addToSet: '$domain' }
925 }
926 },
927 { $sort : { count : -1} }
928]);
929*/
930db.Websites.aggregate([
931 {
932 $match: {
933 $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}}]
934 }
935 },
936 { $unwind: "$geoLocationCountryCode" },
937 {
938 $group: {
939 _id: {$toLower: '$geoLocationCountryCode'},
940 count: { $sum: 1 },
941 domain: { $addToSet: '$domain' }
942 }
943 },
944 { $sort : { count : -1} }
945]);
946
947
948We can knock of another 54 non-NZ sites with our new urlContainsLangCodeInPathPrefix field:
949
950 db.getCollection('Websites').find({urlContainsLangCodeInPathPrefix: true, geoLocationCountryCode: {$ne: "NZ"}, domain: {$not: /.nz$/}}).count()
951 54
952
953
954SO, can repeat query with new field "urlContainsLangCodeInPathPrefix":
955Number of sites containing >= 1 MRI sentences that are not from NZ or of .nz TLD and which don't contain "/mi(/)" or "http(s)://mi." in URL path:
956 db.getCollection('Websites').find({$and: [
957 {numPagesContainingMRI: {$gt: 0}},
958 {geoLocationCountryCode: {$ne: "NZ"}},
959 {domain: {$not: /.nz$/}},
960 {urlContainsLangCodeInPathSuffix: {$ne: true}},
961 {urlContainsLangCodeInPathPrefix: {$ne: true}}
962 ]}).count()
963
964 651
965
966
967REDO THE COUNT BY COUNTRY QUERY FOR THIS:
968
969db.Websites.aggregate([
970 {
971 $match: {
972 $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPathSuffix: {$ne: true}}, {urlContainsLangCodeInPathPrefix: {$ne: true}}]
973 }
974 },
975 { $unwind: "$geoLocationCountryCode" },
976 {
977 $group: {
978 _id: {$toLower: '$geoLocationCountryCode'},
979 count: { $sum: 1 },
980 domain: { $addToSet: '$domain' }
981 }
982 },
983 { $sort : { count : -1} }
984]);
985
986
987AFTER BUGFIX FOR miInURLPath being set at the correct now:
988db.getCollection('Websites').find(
989{$and: [
990 {numPagesContainingMRI: {$gt: 0}},
991 {geoLocationCountryCode: {$ne: "NZ"}},
992 {domain: {$not: /.nz$/}},
993 {urlContainsLangCodeInPath: {$ne: true}}
994]}).count()
995
996220
997
998db.Websites.aggregate([
999 {
1000 $match: {
1001 $and: [
1002 {numPagesContainingMRI: {$gt: 0}},
1003 {geoLocationCountryCode: {$ne: "NZ"}},
1004 {domain: {$not: /.nz$/}},
1005 {urlContainsLangCodeInPath: {$ne: true}}
1006 ]
1007 }
1008 },
1009 { $unwind: "$geoLocationCountryCode" },
1010 {
1011 $group: {
1012 _id: {$toLower: '$geoLocationCountryCode'},
1013 count: { $sum: 1 },
1014 domain: { $addToSet: '$domain' }
1015 }
1016 },
1017 { $sort : { count : -1} }
1018]);
1019
1020Can inspect websites' pages for whether it's relevant/auto-translated as follows:
1021 db.getCollection('Webpages').find({URL:/svenkirsten.com/, mriSentenceCount: {$gt: 0}})
1022
1023
1024* CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/
1025 BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/
1026
1027* FR: 16 sites from FR
1028 http://blueheavenisland.com, http://www.blueheavenisland.com - misdetection. French Polynesia
1029 https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway
1030 http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori.
1031!! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
1032 http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names
1033X http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
1034 http://mahajana.net - misdetected a Japanese Zen Buddhist chant as MRI
1035 http://rapanui.fr - Rapa Nui Easter Island. Misdetected.
1036 http://www.gif.ovh - autotranslated pages. Supposedly a GIF repository
1037 http://baladeornithologique.com - misdetection of the word "Retour"
1038 http://www.gaudry.be - misdetection of Japanese hiragana etc, and French "faire", as MRI
1039 http://www.gototahiti.net - probably misdetection, see title
1040 http://www.maraamusurfskirace.com - Bora Bora, French Polynesia. Misdetected.
1041 http://www.rongo-rongo.com - appears to be related to Easter Island. Just 1 sentence however.
1042 http://pt.city-usa.net - misdetection. Hawaii.
1043 https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages.
1044NL:
1045(!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm]
1046- https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL
1047- tonhut.nl - misidentication
1048? http://nielsonboutique.co.uk, http://longhornlaw.net, http://tetsubo.org, http://hidsonphoto.com, http://wearehomework.com/- Feels autotranslated, but no language options visible. All SEO related
1049- diverosa.com - Rapa Nui, Easter Island
1050- nonlinear.demon.nl - misidentified
1051- encyclo.co.uk - misidentification
1052- henrifloor.nl - misidentification
1053- http://skimap.info/ - maps, NZ placenames in PDF
1054DK:
1055!! ++ http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,
1056http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com,
1057http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com
1058- http://www.rennertweb.de - a photogallery page mentioning NZ placenames
1059CA:
1060- http://bcmarina.com AND http://bckayak.com - photos with Canadian placenames
1061- http://www.myrasplace.net - pagse of photos, captions involving NZ placenames
1062~ http://00.gs/Maniapoto;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing
1063- aguadilla.airport-authority.com - misidentification
1064- https://articles.imperialtometric.com - misidentification
1065- http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames
1066DE:
1067- http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful
1068!! https://www.cartogiraffe.com/ and https://www.cartogiraffe.com - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
1069~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden"
1070- herocity - autotranslated
1071- weltderberge.de - 3 pages mention NZ mountains by name.
1072~ (arts.mythologica.fr) https://mythologica.fr/oceanie/texte/pantheon_polynesien.pdf - mentions certain Maori Gods and other Polynesian Gods by name.
1073- https://traynews.com - nothing in MRI, misdetected
1074~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names
1075- http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI
1076X https://afrikhepri.org/mi/ - autotranslated
1077- https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word
1078- etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected
1079- https://www.you-fly.com - misdetection of German "Warum?" as MRI
1080- http://vulkane.ch - misdetected pages on Hawaiian volcanoes.
1081- http://www.stephe.de - photos from NZ captioned with NZ placenames
1082- http://insecta.pro - misdetection
1083- http://m.distanta.1km.net - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence.
1084- https://ersatzteile-fachversand.de - German misdetected as Maori.
1085- https://laskar02cinta.page.tl/Info.htm - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense.
1086- http://www.behlig.de - misdetection. Photos from Hawaii.
1087!! http://www.udhr.de - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available.
1088- ITALY:
1089 http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename
1090 http://www.marcosanti.it/Reportage/Oceania_ph/Nuova_Zelanda/ - each photo captioned by NZ placename
1091 http://www.pegasoesmicamion.com/ - REO abbreviation misidentified, also in REO%20PUBLICIDAD.htm
1092- AUSTRIA:
1093 petit-prince.at - Tahitian and Wayuu (Venezuela) translations of Le Petit Prince
1094 http://www.tmtmm.net/newzealand - photos from NZ named after places and people's names
1095- ROMANIA: parohiauceadesus.ro - Sentences of single Romanian words misidentified.
1096- ISRAEL:
1097 http://www.daat.ac.il - misidentification of "no." as MRI, and Hebrew words.
1098 https://www.hitiaotera.com/ - misidentifiation of Tahitian pages
1099- RUSSIA: https://www.gismeteo.lv - misidentification of an email address
1100- JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI
1101!! - Ireland, ie: https://coggle.it
1102- IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website
1103- CZECH republic:
1104? https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English
1105!! http://www.henryklahola.nazory.cz/094.Maori.htm and http://henryklahola.nazory.cz variant
1106 http://about.ilikeyou.com - dating site. Misidentification.
1107- SPAIN:
1108!! https://www.uv.es/~pla/red.net/intmaori.html
1109 https://www.reclamaciondevuelos.com - 2 occurrences of the word "kiwi"
1110 http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename
1111 http://www.cruceros-princess.mx/princessMX/Oferta_Cruzeiros_Polinesia.html - Polynesian placenames
1112- SINGAPORE: https://omg-solutions.com - autotranslated
1113- TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated
1114- MEXICO: http://www.gelbukh.com - misidentification, lines of just numbers or phrases like "Area Chair" in English and Russian CVs.
1115- FINLAND: http://pertti.com - travelogue, placenames
1116- SWITZERLAND CH:
1117 nicoledidi.ch - blog, placenames
1118 https://photos.axelebert.org - Tahiti related content
1119- UNKNOWN: https://www.viveipcl.com: tours website, placenames mentioned
1120#- EU: https://www.the-good-stuff-factory.be/mi/ : Autotranslated
1121!! - BULGARIA: http://anitra.net/activism/humanrights/UDHR/rrt_print.htm (2 pages)
1122
1123
1124TREATING AUSTRALIA AND GREAT BRITAIN MORE SPECIALLY (don't ignore /mi in URL, same as with NZ, but do leave out .nz TLDs since we cover them under NZ - TODO: later find country codes of all .nz TLDs):
1125[nothing found under "UK", only under "GB"]
1126
1127db.getCollection('Websites').find({
1128 domain: {$not: /.nz$/},
1129 numPagesContainingMRI: {$gt: 0},
1130 $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}]
1131}).count()
113211
1133
1134db.Websites.aggregate([
1135 {
1136 $match: {
1137 domain: {$not: /.nz$/},
1138 numPagesContainingMRI: {$gt: 0},
1139 $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}]
1140 }
1141 },
1142 { $unwind: "$geoLocationCountryCode" },
1143 {
1144 $group: {
1145 _id: {$toLower: '$geoLocationCountryCode'},
1146 count: { $sum: 1 },
1147 domain: { $addToSet: '$domain' }
1148 }
1149 },
1150 { $sort : { count : -1} }
1151]);
1152
1153AUSTRALIA:
1154!! https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]
1155? http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!")
1156!! https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd - site of individual pages (like docs.google.com). This one has a relevant infogram image.
1157!! https://koreromaori.com - some actual Maori language sentences
1158 http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames
1159
1160UK:
1161 http://www.wordsearchfun.com/200628_Word_Find_wordsearch.html - 2 word games with Maori words (one of them has 3 different views, e.g. print view)
1162? https://omniatlas.com/maps/australasia/18400206/plain/ - historical map with Maori iwi names over NZ map regions
1163? https://omniatlas.com/maps/australasia/18400206/ - historical map of Australia and NZ at the time of the Treaty of Waitangi, with events marked in English
1164 https://centrallanguageschool.com - AUTOTRANSLATED
1165 https://www.solasolv.com - Autotranslated product site
1166 http://mikestephens.co.uk/ - photo captions containing NZ placenames
1167 http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames
1168
1169--------------
1170
1171GETTING TABLE DATA OUT OF MONGO DB:
1172
1173https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo
1174"export to file" as in a spreadsheet like to a .csv?
1175
1176IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo):
1177
1178 1. In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything
1179
1180 2. paste everything into this website: https://json-csv.com/
1181
1182 3. click the download button and now you have it in a spreadsheet.
1183
1184
1185https://json-csv.com/
1186
1187
1188---------------------
1189
1190Count of websites that have at least 1 page containing at least one sentence detected as MRI
1191AND which websites have mi in the URL path:
1192
1193db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
1194
1195491
1196
1197
1198
1199# The websites that have some MRI detected AND which are either in NZ or with NZ TLD
1200# or (so if they're from overseas) don't contain /mi or mi.* in URL path:
1201
1202db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}]}).count()
1203396
1204
1205Include Australia (to get the valid "kiwiproperty.com" website included in the result list):
1206
1207db.getCollection('Websites').find({$and: [
1208 {numPagesContainingMRI: {$gt: 0}},
1209 {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
1210 ]}).count()
1211
1212397
1213
1214# aggregate results by a count of country codes
1215db.Websites.aggregate([
1216 {
1217 $match: {
1218 $and: [
1219 {numPagesContainingMRI: {$gt: 0}},
1220 {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
1221 ]
1222 }
1223 },
1224 { $unwind: "$geoLocationCountryCode" },
1225 {
1226 $group: {
1227 _id: {$toLower: '$geoLocationCountryCode'},
1228 count: { $sum: 1 }
1229 }
1230 },
1231 { $sort : { count : -1} }
1232]);
1233
1234
1235# Just considering those sites outside NZ or not with .nz TLD:
1236db.Websites.aggregate([
1237 {
1238 $match: {
1239 $and: [
1240 {geoLocationCountryCode: {$ne: "NZ"}},
1241 {domain: {$not: /\.nz/}},
1242 {numPagesContainingMRI: {$gt: 0}},
1243 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
1244 ]
1245 }
1246 },
1247 { $unwind: "$geoLocationCountryCode" },
1248 {
1249 $group: {
1250 _id: {$toLower: '$geoLocationCountryCode'},
1251 count: { $sum: 1 },
1252 domain: { $addToSet: '$domain' }
1253 }
1254 },
1255 { $sort : { count : -1} }
1256]);
1257
1258
1259# counts by country code excluding NZ related sites
1260db.getCollection('Websites').find({$and: [
1261 {geoLocationCountryCode: {$ne: "NZ"}},
1262 {domain: {$not: /\.nz/}},
1263 {numPagesContainingMRI: {$gt: 0}},
1264 {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
1265 ]}).count()
1266
1267221 websites
1268
1269
1270# But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
1271db.getCollection('Websites').find({$and: [
1272 {numPagesContainingMRI: {$gt: 0}},
1273 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1274 ]}).count()
1275
1276176
1277
1278(Total is 221+176 = 397, which adds up).
1279
1280# Get the count (and domain listing) output put under a hardcoded _id of "nz":
1281db.Websites.aggregate([
1282 {
1283 $match: {
1284 $and: [
1285 {numPagesContainingMRI: {$gt: 0}},
1286 {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
1287 ]
1288 }
1289 },
1290 { $unwind: "$geoLocationCountryCode" },
1291 {
1292 $group: {
1293 _id: "nz",
1294 count: { $sum: 1 },
1295 domain: { $addToSet: '$domain' }
1296 }
1297 },
1298 { $sort : { count : -1} }
1299]);
1300
1301
1302-----------------------
1303US:
1304Done: manually inspected 68/117 sites
1305
1306TOTAL US: 4+7+7+4+3=25
1307
1308DEFINITELY:
1309+ http://anglicanhistory.org,
1310+ http://www.unicode.org, [Universal declaration of Human Rights]
1311+ https://static-promote.weebly.com,
1312+ http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY]
1313
1314BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations:
1315+ http://bahaiprayers.net, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too]
1316+ https://biblehub.com,
1317+ http://www.muhammad.com, [possibly not autotranslated]
1318+ http://www.godrules.net, [possibly not autotranslated]
1319+ http://m.biblepub.com,
1320+ http://www.krassotkin.ru, [probably real translations, as there are multiple Dutch translations from different sources provided]
1321+ http://www.gotquestions.org, [doesn't appear autotranslated]
1322X https://ebible.org, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori]
1323X https://www.bible.com, doesn't have Maori translation. Misdetected.
1324X https://wol.jw.org, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected
1325X https://png.bible, [misdetected, Papua New Guinea]
1326X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters.
1327
1328CHECK - PROBABLY:
1329!! https://maorinews.com,
1330!! http://maaori.com,
1331!!+ http://kiaorahola.blogspot.com/
1332+ https://kjohnsonnz.blogspot.com,
1333+ http://pumanawawhangara.blogspot.com,
1334+ http://dannykahei.tripod.com,
1335+ http://burkekm001.tripod.com
1336+ http://tkkpipipaopao.blogspot.com,
1337+ http://manateina.blogspot.com,
1338? tkkpipipaopao.blogspot.com? http://rangiwewehi.com, [English, but community]
1339? https://www.terakau.org, [COMMUNITY, but English]
1340? https://www.pipirikiapapatuanuku.org, [COMMUNITY?, in English, environment site]
1341~ http://georgegi.tripod.com,
1342~ http://ngarangatahi.tripod.com, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden]
1343X http://fhr.kiwicelts.com,
1344X http://tkrow.tripod.com, [English, background of NZ place]
1345X http://www.mkiwi.com, - placenames
1346X http://www.waimate.com, [English, NZ place]
1347
1348MAYBE, INSPECT:
1349? https://www.natekore2018.com, [lots of English, but COMMUNITY, CULTURE]
1350+ http://tatai09.blogspot.com,
1351+ http://www.twttoa.com,
1352+ http://tuhua2010.blogspot.com,
1353X http://www.huapala.org, [misdetected, Hawaiian]
1354X https://www.vaihaunui.net, [misdetected, Tahiti]
1355X https://www.kaifineart.com, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected]
1356X http://mahoraroom8.blogspot.com, [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI
1357+ http://piripi.blogspot.com,
1358X http://www.hiroa.pf, [misdetected. Crawled content appears Polynesian not Maori]
1359X http://korora.econ.yale.edu, [NZ place photo caption]
1360X https://www.poehalisnami.ua, [mostly Cyrillic, with some NZ or Polynesian names misdetected]
1361X http://hannas-reiseblog.blogspot.com - one page contained NZ placenames, another had a word misdetected
1362
1363
1364+ https://www.breaker.audio, [audio, with occasional English.]
1365? https://livestream.com, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?]
1366
1367X https://docs.google.com, timetable with occasional Maori language word
1368+ https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position.
1369http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
1370
1371
1372PINTEREST
1373+ https://in.pinterest.com/pin/317363104978423418/
1374 "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google"
1375? https://za.pinterest.com/pin/524669425310419500/
1376 Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE]
1377[The other pinterest detected as numPagesContainingMRI > 0 was misdetected]
1378
1379https://nl.pinterest.com,
1380https://www.pinterest.jp,
1381https://www.pinterest.it,
1382https://www.pinterest.co.uk,
1383https://www.pinterest.ca,
1384https://za.pinterest.com,
1385https://www.pinterest.fr,
1386https://in.pinterest.com,
1387
1388MORE BLOGSPOTS
1389X http://word-dialect.blogspot.com, [Indonesian, misdetected]
1390~ http://atopeconlostopes.blogspot.com, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.]
1391X http://lianzaconference2012.blogspot.com, [NZ placename or institution]
1392? http://mrshamiltonskoolkidz.blogspot.com, [te reo Maori related school activities. Described in English.]
1393X http://capsuraotearoa.blogspot.com, [blog in French, photo captions contain NZ placenames]
1394X http://blogdepasopor.blogspot.com, [blog in French, Rapa Nui/Easter Island related content, misdetected.]
1395
1396
1397UNLIKELY
1398?? http://naturalfatburner.net, http://naturalfatburner.net/NoNonsenseTed/fatloss-mao/ feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick]
1399
1400
1401BLACKLIST:
1402X http://ww25.milfsplease.com,
1403X http://www.the-naked.com
1404
1405OTHER:
1406X http://seapixonline.com, https://www.seapixonline.com, [photo captions of ships. Sometimes misdetected Japanese words as MRI.]
1407X http://www.code-postal.com, https://www.code-postal.com, [not more than 1 sentence detected as in MRI]
1408X https://www.dbnames.net, [Name database, lots misdetected]
1409
1410STILL TO DO LIST:
1411
1412X https://www.myadsclassified.com, [misdetected 3 short English sentences as MRI]
1413X http://www.whoisthatr.com, [misdetected short English sentence as MRI]
1414X https://www.oemsec.com, [autotranslated product site]
1415X http://svenskadress.net, [linkfarm like site of related junk links, contained URLs misdetected as MRI]
1416
1417X https://www.webwiki.com, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ]
1418X http://mikebonnice.com, [Hawaiian and Tahiti related content misdetected]
1419X http://www.hudl.com, [misdetected short English sentence as MRI]
1420X http://www.wikitree.com, [misdetected short English sentence as MRI]
1421X http://shuttersportnelson.photoshelter.com, [image captions of "Wairua Warrior"]
1422
1423X http://niken8media.logdown.com, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.]
1424X https://www.podrozeady.com, Looks Polish or other East-European language. The NZ page https://www.podrozeady.com/NZ/4/ had placenames detected.
1425
1426X http://www.thesalmons.org, [detection and misdetection of author names of papers hosted]
1427
1428X http://linkvip.top, [.rar and media file links misdetected as MRI]
1429
1430
1431X http://www.lunar-occultations.com, [NZ place names for astronomical phenomena]
1432X http://shangrilapress.net, [NZ placenames]
1433X http://malecek.com, [misdetection CD title]
1434X https://www.blue-frontiers.com, [Tahitian, Reo Tahiti, misdetected as MRI]
1435X http://www.whoisentry.com, [URL names, looked at several which were probably misdetected as MRI]
1436X http://loquevendra318.com, [uses Google translate for auto-translation]
1437
1438
1439?? http://www.forensicfashion.com, [historical information, useful for CULTURE? e.g. http://www.forensicfashion.com/1807MaoriChief.html]
1440
1441X http://www.eyecontactsite.com, [Lots of names. And a few short sentences or words possibly in comments.]
1442X http://eartheum.com, [Rapa Nui, Easter Island related content. Misdetected]
1443X http://www.steve-wheeler.co.uk, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"]
1444X https://chromium.googlesource.com, [some source code related to languages' two letter codes]
1445
1446X http://www.roadsmile.com, [Lots of misdetection based on word Kia.]
1447?? https://www.knowatom.com, https://phet.colorado.edu [Similar looking science web sites for children. Uses auto-translation?]
1448
1449X https://www.indexmundi.com, [place names. Pages about Solomon Islands. Misdetection of placenames.]
1450
1451
1452
1453X http://wowwars.net, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ]
1454?? https://www.hidroponia.org.mx, [Not sure if https://www.hidroponia.org.mx/index.php/idiomas/284-hydroponics-te-ahurea-wai-maori is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?]
1455X http://www.v3whois.com, [URLs are misdetected as MRI]
1456X http://rhymebrain.com, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages]
1457
1458
1459X SINGLE SENTENCE DETECTED (NO MORE AND NOT PAGE:)
1460 http://frontrowphotos.com,
1461 http://www.pressreader.com,
1462 https://www.nccri.ie,
1463 http://takethatvacation.com,
1464 http://worldradiomap.com,
1465 http://www.namesdir.com,
1466
1467 X http://www.frogsonline.com, [NZ hotels, placenames]
1468 X http://www.geni.com, [Single sentence misdetection]
1469 X http://wikiedit.org, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui]
1470
1471
1472
1473---------------
1474
1475MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY
1476NZ: 176
1477US: 25
1478AU: 3
1479FR: 1
1480DK: 2
1481(CA: 0.5)
1482DE: 2
1483IE (Ireland): 1
1484CZ: 1
1485ES: 1
1486BG: 1
1487
1488TIDIED:
1489NZ: 176
1490US: 25
1491AU: 3
1492DE: 2
1493DK: 2
1494BG: 1
1495CZ: 1
1496ES: 1
1497FR: 1
1498IE: 1
1499TOTAL: 213
1500
1501
Note: See TracBrowser for help on using the repository browser.