[33644] | 1 | MongoDB
|
---|
| 2 | Installation:
|
---|
| 3 | https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
| 4 | https://docs.mongodb.com/manual/administration/install-on-linux/
|
---|
| 5 | https://hevodata.com/blog/install-mongodb-on-ubuntu/
|
---|
| 6 | https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04
|
---|
| 7 | CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/
|
---|
| 8 | FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source
|
---|
| 9 | GUI:
|
---|
| 10 | https://robomongo.org/
|
---|
| 11 | Robomongo is Robo 3T now
|
---|
| 12 |
|
---|
| 13 | https://www.tutorialspoint.com/mongodb/mongodb_java.htm
|
---|
| 14 | JAR FILE:
|
---|
| 15 | http://central.maven.org/maven2/org/mongodb/mongo-java-driver/
|
---|
| 16 | https://mongodb.github.io/mongo-java-driver/
|
---|
| 17 |
|
---|
| 18 |
|
---|
| 19 |
|
---|
| 20 | https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
| 21 | http://www.programmersought.com/article/6500308940/
|
---|
| 22 |
|
---|
| 23 | 52 sudo apt-get install mongodb-clients
|
---|
| 24 | 53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
| 25 |
|
---|
| 26 | Failed with
|
---|
| 27 | Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
|
---|
| 28 | exception: connect failed
|
---|
| 29 |
|
---|
| 30 | This is due to a version incompatibility between Client and mongodb Server.
|
---|
| 31 | The solution is to follow instructions at http://www.programmersought.com/article/6500308940/
|
---|
| 32 | and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
| 33 | as below:
|
---|
| 34 |
|
---|
| 35 | 54 sudo apt-get purge mongodb-clients
|
---|
| 36 | 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
|
---|
| 37 | 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
---|
| 38 | 57 sudo apt-get update
|
---|
| 39 | 58 sudo apt-get install mongodb-clients
|
---|
| 40 | 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
| 41 | (still doesn't work)
|
---|
| 42 | 60 sudo apt-get install -y mongodb-org
|
---|
| 43 | The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server?
|
---|
| 44 | 72 sudo service mongod status
|
---|
| 45 |
|
---|
| 46 | 103 sudo service mongod start
|
---|
| 47 | "mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections
|
---|
| 48 | 104 sudo service mongod status
|
---|
| 49 | 88 sudo service mongod stop
|
---|
| 50 |
|
---|
| 51 |
|
---|
| 52 | DETAILS:
|
---|
| 53 |
|
---|
| 54 | wharariki:[879]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
| 55 |
|
---|
| 56 | didn't work with the pwd. Failed with:
|
---|
| 57 |
|
---|
| 58 | MongoDB shell version: 2.6.10
|
---|
| 59 | Enter password:
|
---|
| 60 | connecting to: mongodb://mongodb.cms.waikato.ac.nz:27017
|
---|
| 61 | 2019-11-04T20:02:47.970+1300 Assertion: 13110:HostAndPort: host is empty
|
---|
| 62 | 2019-11-04T20:02:47.970+1300 0x6b75c9 0x659e9f 0x636f69 0x4fa55c 0x501249 0x4fa7f1 0x6006fd 0x5eb869 0x7f7bfbd47d76 0x1f3c10d06362
|
---|
| 63 | mongo(_ZN5mongo15printStackTraceERSo+0x39) [0x6b75c9]
|
---|
| 64 | mongo(_ZN5mongo10logContextEPKc+0x21f) [0x659e9f]
|
---|
| 65 | mongo(_ZN5mongo11msgassertedEiPKc+0xd9) [0x636f69]
|
---|
| 66 | mongo(_ZN5mongo16ConnectionString12_fillServersENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50c) [0x4fa55c]
|
---|
| 67 | mongo(_ZN5mongo16ConnectionStringC1ENS0_14ConnectionTypeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_+0x99) [0x501249]
|
---|
| 68 | mongo(_ZN5mongo16ConnectionString5parseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERS6_+0x201) [0x4fa7f1]
|
---|
| 69 | mongo(_ZN5mongo17mongoConsExternalEPNS_7V8ScopeERKN2v89ArgumentsE+0x11d) [0x6006fd]
|
---|
| 70 | mongo(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0xa9) [0x5eb869]
|
---|
| 71 | /usr/lib/libv8.so.3.14.5(+0x99d76) [0x7f7bfbd47d76]
|
---|
| 72 | [0x1f3c10d06362]
|
---|
| 73 | 2019-11-04T20:02:47.971+1300 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
|
---|
| 74 | exception: connect failed
|
---|
| 75 |
|
---|
| 76 |
|
---|
| 77 | This is due to a version incompatibility between Client and mongodb Server.
|
---|
| 78 | Can find client version above. (2.6.10)
|
---|
| 79 | Server version can be found by running the mongo client shell. Doing so without loading a db:
|
---|
| 80 |
|
---|
| 81 |
|
---|
| 82 | wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb
|
---|
| 83 | MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION
|
---|
| 84 | type "help" for help
|
---|
| 85 | > help
|
---|
| 86 | db.help() help on db methods
|
---|
| 87 | db.mycoll.help() help on collection methods
|
---|
| 88 | sh.help() sharding helpers
|
---|
| 89 | rs.help() replica set helpers
|
---|
| 90 | help admin administrative help
|
---|
| 91 | help connect connecting to a db help
|
---|
| 92 | help keys key shortcuts
|
---|
| 93 | help misc misc things to know
|
---|
| 94 | help mr mapreduce
|
---|
| 95 |
|
---|
| 96 | show dbs show database names
|
---|
| 97 | show collections show collections in current database
|
---|
| 98 | show users show users in current database
|
---|
| 99 | show profile show most recent system.profile entries with time >= 1ms
|
---|
| 100 | show logs show the accessible logger names
|
---|
| 101 | show log [name] prints out the last segment of log in memory, 'global' is default
|
---|
| 102 | use <db_name> set current database
|
---|
| 103 | db.foo.find() list objects in collection foo
|
---|
| 104 | db.foo.find( { a : 1 } ) list objects in foo where a == 1
|
---|
| 105 | it result of the last line evaluated; use to further iterate
|
---|
| 106 | DBQuery.shellBatchSize = x set default number of items to display on shell
|
---|
| 107 | exit quit the mongo shell
|
---|
| 108 |
|
---|
| 109 | > help connect
|
---|
| 110 |
|
---|
| 111 | Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options.
|
---|
| 112 | Additional connections may be opened:
|
---|
| 113 |
|
---|
| 114 | var x = new Mongo('host[:port]');
|
---|
| 115 | var mydb = x.getDB('mydb');
|
---|
| 116 | or
|
---|
| 117 | var mydb = connect('host[:port]/mydb');
|
---|
| 118 |
|
---|
| 119 | Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
|
---|
| 120 |
|
---|
| 121 | Getting help on connect options:
|
---|
| 122 |
|
---|
| 123 | > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017');
|
---|
| 124 | > var mydb = x.getDB('anupama');
|
---|
| 125 |
|
---|
| 126 | > mydb.connect.help()
|
---|
| 127 | DBCollection help
|
---|
| 128 | db.connect.find().help() - show DBCursor help
|
---|
| 129 | db.connect.count()
|
---|
| 130 | db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
|
---|
| 131 | db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command
|
---|
| 132 | db.connect.dataSize()
|
---|
| 133 | db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' )
|
---|
| 134 | db.connect.drop() drop the collection
|
---|
| 135 | db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } )
|
---|
| 136 | db.connect.dropIndexes()
|
---|
| 137 | db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
|
---|
| 138 | db.connect.reIndex()
|
---|
| 139 | db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
|
---|
| 140 | e.g. db.connect.find( {x:77} , {name:1, x:1} )
|
---|
| 141 | db.connect.find(...).count()
|
---|
| 142 | db.connect.find(...).limit(n)
|
---|
| 143 | db.connect.find(...).skip(n)
|
---|
| 144 | db.connect.find(...).sort(...)
|
---|
| 145 | db.connect.findOne([query])
|
---|
| 146 | db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
|
---|
| 147 | db.connect.getDB() get DB object associated with collection
|
---|
| 148 | db.connect.getPlanCache() get query plan cache associated with collection
|
---|
| 149 | db.connect.getIndexes()
|
---|
| 150 | db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
|
---|
| 151 | db.connect.insert(obj)
|
---|
| 152 | db.connect.mapReduce( mapFunction , reduceFunction , <optional params> )
|
---|
| 153 | db.connect.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
|
---|
| 154 | db.connect.remove(query)
|
---|
| 155 | db.connect.renameCollection( newName , <dropTarget> ) renames the collection.
|
---|
| 156 | db.connect.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
|
---|
| 157 | db.connect.save(obj)
|
---|
| 158 | db.connect.stats()
|
---|
| 159 | db.connect.storageSize() - includes free space allocated to this collection
|
---|
| 160 | db.connect.totalIndexSize() - size in bytes of all the indexes
|
---|
| 161 | db.connect.totalSize() - storage allocated for all data and indexes
|
---|
| 162 | db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi
|
---|
| 163 | db.connect.validate( <full> ) - SLOW
|
---|
| 164 | db.connect.getShardVersion() - only for use with sharding
|
---|
| 165 | db.connect.getShardDistribution() - prints statistics about data distribution in the cluster
|
---|
| 166 | db.connect.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
|
---|
| 167 | db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
|
---|
| 168 | db.connect.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
|
---|
| 169 | db.connect.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
|
---|
| 170 | > mydb.version()
|
---|
| 171 | 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION
|
---|
| 172 |
|
---|
| 173 | (Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb)
|
---|
| 174 |
|
---|
| 175 | Finally we now know the mongodb server version: 4.0.13
|
---|
| 176 | This version doesn't work with our mongo client (shell) version of 2.6.10.
|
---|
| 177 |
|
---|
| 178 |
|
---|
| 179 | DETAILS OF INSTALLING MONGO-CLIENT AND UPDATING IT, AND INSTALLING MONGODB SERVER:
|
---|
| 180 |
|
---|
| 181 |
|
---|
| 182 | 54 sudo apt-get purge mongodb-clients
|
---|
| 183 | 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
|
---|
| 184 | 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
---|
| 185 | 57 sudo apt-get update
|
---|
| 186 | 58 sudo apt-get install mongodb-clients
|
---|
| 187 | 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
| 188 | 60 sudo apt-get install -y mongodb-org
|
---|
| 189 | 61 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
| 190 | 62 sudo service apache2 status
|
---|
| 191 | 63 sudo service sshd status
|
---|
| 192 | 64 sudo service mongodb status
|
---|
| 193 | 65 sudo service mongo status
|
---|
| 194 | 66 mongod
|
---|
| 195 | 67 mongod --help
|
---|
| 196 | 68 mongod --help | less
|
---|
| 197 | 69 mongod -f /etc/mongod.conf
|
---|
| 198 | 70 sudo mongod -f /etc/mongod.conf
|
---|
| 199 | 71 less /etc/mongod.conf
|
---|
| 200 | 72 sudo service mongod status
|
---|
| 201 | 73 sudo service mongod start
|
---|
| 202 | 74 sudo service mongod status
|
---|
| 203 | 75 ls -l /var/log/mongodb/mongod.log
|
---|
| 204 | 76 sudo rm /var/log/mongodb/mongod.log
|
---|
| 205 | 77 sudo service mongod status
|
---|
| 206 | 78 sudo service mongod start
|
---|
| 207 | 79 sudo service mongod status
|
---|
| 208 | 80 sudo service mongod stop
|
---|
| 209 | 81 ps auxww | grep mongo
|
---|
| 210 | 82 sudo service mongod start
|
---|
| 211 | 83 sudo service mongod status
|
---|
| 212 | 84 ps auxww | grep mongo
|
---|
| 213 | 85 sudo dmsg
|
---|
| 214 | 86 sudo dmesg
|
---|
| 215 | 87 sudo service mongod status
|
---|
| 216 | 88 sudo service mongod stop
|
---|
| 217 | 89 sudo service mongod start
|
---|
| 218 | 90 sudo dmesg
|
---|
| 219 | 91 sudo less /var/log/mongodb/mongod.log
|
---|
| 220 | 92 ls /var/lib/
|
---|
| 221 | 93 ls -ld /var/lib/
|
---|
| 222 | 94 ls -l /var/log/mongodb/mongod.log
|
---|
| 223 | 95 ls -ld /var/lib/
|
---|
| 224 | 96 groups mongodb
|
---|
| 225 | 97 less /etc/mongod.conf
|
---|
| 226 | 98 sudo less /var/log/mongodb/mongod.log
|
---|
| 227 | 99 less /etc/mongod.conf
|
---|
| 228 | 100 ls -l /var/lib/mongodb/
|
---|
| 229 | 101 sudo chown -R mongodb /var/lib/mongodb/
|
---|
| 230 | 102 sudo chgrp -R mongodb /var/lib/mongodb/
|
---|
| 231 | 103 sudo service mongod start
|
---|
| 232 | 104 sudo service mongod status
|
---|
| 233 | 105 history
|
---|
| 234 |
|
---|
| 235 |
|
---|
| 236 |
|
---|
| 237 | MONGO DB ROBO 3T
|
---|
| 238 | 1. Download "Double Pack" from https://robomongo.org/
|
---|
| 239 | 2. Untar its contents. Then untar the tarball in that.
|
---|
| 240 | 3. Run:
|
---|
| 241 | wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t
|
---|
| 242 |
|
---|
| 243 | ===================
|
---|
| 244 | On analytics, vagrant node1, we've installed the mongodb server and client.
|
---|
| 245 | We're able to successfully create collections on here.
|
---|
| 246 |
|
---|
| 247 |
|
---|
| 248 | vagrant@node1:~$ mongo
|
---|
| 249 | MongoDB shell version v4.2.1
|
---|
| 250 | connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
|
---|
| 251 | Implicit session: session { "id" : UUID("87bb585c-4685-47f6-bf89-a93801daeb2d") }
|
---|
| 252 | MongoDB server version: 4.2.1
|
---|
| 253 | Server has startup warnings:
|
---|
| 254 | 2019-11-04T07:48:14.197+0000 I STORAGE [initandlisten]
|
---|
| 255 | 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
|
---|
| 256 | 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** See http://dochub.mongodb.org/core/prodnotes-filesystem
|
---|
| 257 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
|
---|
| 258 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database.
|
---|
| 259 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted.
|
---|
| 260 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
|
---|
| 261 | ---
|
---|
| 262 | Enable MongoDB's free cloud-based monitoring service, which will then receive and display
|
---|
| 263 | metrics about your deployment (disk utilization, CPU, operation statistics, etc).
|
---|
| 264 |
|
---|
| 265 | The monitoring data will be available on a MongoDB website with a unique URL accessible to you
|
---|
| 266 | and anyone you share the URL with. MongoDB may use this information to make product
|
---|
| 267 | improvements and to suggest MongoDB products and deployment options to you.
|
---|
| 268 |
|
---|
| 269 | To enable free monitoring, run the following command: db.enableFreeMonitoring()
|
---|
| 270 | To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
|
---|
| 271 | ---
|
---|
| 272 |
|
---|
| 273 | > show dbs
|
---|
| 274 | admin 0.000GB
|
---|
| 275 | config 0.000GB
|
---|
| 276 | local 0.000GB
|
---|
| 277 | > use db ateacrawldata
|
---|
| 278 | 2019-11-05T05:24:20.155+0000 E QUERY [js] Error: [db ateacrawldata] is not a valid database name :
|
---|
| 279 | Mongo.prototype.getDB@src/mongo/shell/mongo.js:51:12
|
---|
| 280 | getDatabase@src/mongo/shell/session.js:913:28
|
---|
| 281 | DB.prototype.getSiblingDB@src/mongo/shell/db.js:22:12
|
---|
| 282 | shellHelper.use@src/mongo/shell/utils.js:803:10
|
---|
| 283 | shellHelper@src/mongo/shell/utils.js:790:15
|
---|
| 284 | @(shellhelp2):1:1
|
---|
| 285 | > db.createCollection('webpages');
|
---|
| 286 | { "ok" : 1 }
|
---|
[33646] | 287 | > db.webpages.drop();
|
---|
[33644] | 288 | ... ^C
|
---|
| 289 |
|
---|
| 290 | > db.webpages.drop();
|
---|
| 291 | true
|
---|
| 292 | > use ateacrawldata
|
---|
| 293 | switched to db ateacrawldata
|
---|
| 294 | > db.createCollection('webpages');
|
---|
| 295 | { "ok" : 1 }
|
---|
| 296 | > show collections
|
---|
| 297 | webpages
|
---|
| 298 | > db.createCollection('websites');
|
---|
| 299 | { "ok" : 1 }
|
---|
| 300 | >
|
---|
| 301 |
|
---|
| 302 | ------------------------
|
---|
| 303 |
|
---|
| 304 | Ask Clint to rename "anupama" database to "ateacrawldata" database following the instructions at:
|
---|
| 305 | https://stackoverflow.com/questions/9201832/how-do-you-rename-a-mongodb-database
|
---|
| 306 | I don't have permissions to do this.
|
---|
| 307 | Nor do I have permissions to create Mongo collections within a new database that I create, like ateacrawldata.
|
---|
| 308 | I only seem to have rights to the "anupama" database.
|
---|
| 309 |
|
---|
| 310 |
|
---|
[33646] | 311 |
|
---|
| 312 | -----------------------
|
---|
[33722] | 313 | Vagrant virtual machine Node1 has the mongodb installed.
|
---|
[33646] | 314 |
|
---|
[33722] | 315 | After doing "vagrant up" on node1 to start node1:
|
---|
| 316 |
|
---|
| 317 | [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh
|
---|
| 318 | vagrant@node1:~$ mongo
|
---|
| 319 | MongoDB shell version v4.2.1
|
---|
| 320 | connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
|
---|
| 321 | 2019-11-13T09:22:46.996+0000 E QUERY [js] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused :
|
---|
| 322 | connect@src/mongo/shell/mongo.js:341:17
|
---|
| 323 | @(connect):2:6
|
---|
| 324 | 2019-11-13T09:22:46.999+0000 F - [main] exception: connect failed
|
---|
| 325 | 2019-11-13T09:22:46.999+0000 E - [main] exiting with code 1
|
---|
| 326 | vagrant@node1:~$ sudo service mongod status
|
---|
| 327 | â mongod.service - MongoDB Database Server
|
---|
| 328 | Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
|
---|
| 329 | Active: inactive (dead)
|
---|
| 330 | Docs: https://docs.mongodb.org/manual
|
---|
| 331 | vagrant@node1:~$ sudo service mongod start
|
---|
| 332 | vagrant@node1:~$ sudo service mongod status
|
---|
| 333 | â mongod.service - MongoDB Database Server
|
---|
| 334 | Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
|
---|
| 335 | Active: active (running) since Wed 2019-11-13 09:24:07 UTC; 2s ago
|
---|
| 336 | Docs: https://docs.mongodb.org/manual
|
---|
| 337 | Main PID: 4383 (mongod)
|
---|
| 338 | Tasks: 32
|
---|
| 339 | Memory: 199.3M
|
---|
| 340 | CPU: 754ms
|
---|
| 341 | CGroup: /system.slice/mongod.service
|
---|
| 342 | ââ4383 /usr/bin/mongod --config /etc/mongod.conf
|
---|
| 343 |
|
---|
| 344 | Nov 13 09:24:07 node1 systemd[1]: Started MongoDB Database Server.
|
---|
| 345 | vagrant@node1:~$
|
---|
| 346 |
|
---|
| 347 |
|
---|
| 348 | So now mongodb is running on node1 on localhost:27017.
|
---|
| 349 |
|
---|
[33905] | 350 | Next, in another x-term on analytics connected to the node1 Vagrant VM while port forwarding node1's localhost:27017 to analytics' localhost:27017:
|
---|
[33722] | 351 | vagrant ssh -- -L 27017:localhost:27017
|
---|
| 352 |
|
---|
| 353 |
|
---|
| 354 |
|
---|
[33905] | 355 | Finally, in another x-term (on wharariki), port-forward from analytics:27017 to current machine's 27017:
|
---|
[33722] | 356 | ssh -L 27017:localhost:27017 analytics
|
---|
| 357 |
|
---|
| 358 |
|
---|
[33905] | 359 | Run Robo-3T: go to /home/anupama/Downloads/robo3t-1.3.1-linux-x86_64-7419c406/bin
|
---|
| 360 | and double click robo3t
|
---|
| 361 |
|
---|
| 362 | In the connection screen, choose localhost:27017.
|
---|
[33722] | 363 | Now can connect Robo-3T running on current machine to localhost:27017.
|
---|
| 364 |
|
---|
| 365 | Then in a new x-term, can use the client mongo shell to connect (by default to localhost:27017):
|
---|
| 366 |
|
---|
| 367 | wharariki:[122]/Scratch/ak19/GS309>mongo --shell
|
---|
| 368 | MongoDB shell version v4.0.13
|
---|
| 369 | connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
|
---|
| 370 | ...
|
---|
| 371 | > show dbs
|
---|
| 372 | admin 0.000GB
|
---|
| 373 | ateacrawldata 1.532GB
|
---|
| 374 | config 0.000GB
|
---|
| 375 | local 0.000GB
|
---|
| 376 | > use ateacrawldata
|
---|
| 377 |
|
---|
| 378 | > show collections
|
---|
| 379 | Webpages
|
---|
| 380 | Websites
|
---|
| 381 | oldwebpages
|
---|
| 382 | oldwebsites
|
---|
| 383 | -------------------
|
---|
| 384 |
|
---|
| 385 | Country code to geolocation CSV file found by Dr Bainbridge:
|
---|
| 386 | https://developers.google.com/public-data/docs/canonical/countries_csv
|
---|
| 387 |
|
---|
| 388 | Import into mongodb with:
|
---|
| 389 | https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv
|
---|
| 390 |
|
---|
| 391 |
|
---|
| 392 |
|
---|
| 393 | NOTE: mongoimport is a commandline utility and not a command to be run from the mongo shell. See https://jira.mongodb.org/browse/DOCS-11072
|
---|
| 394 | This means, in an x-term, DON'T RUN MONGO SHELL/client first. Instead, directly from x-term, run the following to import the countrycodes.csv file:
|
---|
| 395 |
|
---|
| 396 |
|
---|
| 397 | mongoimport -d ateacrawldata -c countrylocations --type csv --file /Scratch/ak19/maori-lang-detection/MoreReading/countrycodes.csv --headerline
|
---|
| 398 |
|
---|
| 399 |
|
---|
| 400 | -------------------------
|
---|
| 401 |
|
---|
[33646] | 402 | MONGODB QUERIES:
|
---|
| 403 |
|
---|
| 404 | db.getCollection('webpages').find({"isMRI": true, "singleSentences.langCode": "mri"})
|
---|
| 405 | db.getCollection('webpages').find({"singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"})
|
---|
[33653] | 406 | db.getCollection('Webpages').find({"isMRI": true, "singleSentences": { $elemMatch: {"langCode":"eng"} } }, {"singleSentences.$": "eng"}) [single English lang sentence]
|
---|
| 407 | db.getCollection('Webpages').find({"containsMRI": true, "singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) [gets 1st sentence of docs which have sentences containing MRI]
|
---|
[33646] | 408 |
|
---|
| 409 |
|
---|
| 410 | READING
|
---|
| 411 |
|
---|
| 412 | mongodb java convert class
|
---|
| 413 | https://www.quora.com/What-are-the-ways-of-converting-a-Java-object-to-a-MongoDB-document-and-vice-versa
|
---|
| 414 | https://stackoverflow.com/questions/39320825/pojo-to-org-bson-document-and-vice-versa
|
---|
| 415 | X https://mongodb.github.io/morphia/
|
---|
| 416 | https://stackoverflow.com/questions/10170506/inserting-java-object-to-mongodb-collection-using-java
|
---|
| 417 | X https://www.google.com/search?q=morphia+example&oq=morphia+example&aqs=chrome.0.0l6.4223j0j9&sourceid=chrome&ie=UTF-8
|
---|
| 418 | https://www.baeldung.com/mongodb-morphia
|
---|
| 419 | X https://web.archive.org/web/20171117121335/http://mongodb.github.io/morphia/1.3/getting-started/
|
---|
| 420 | => https://morphia.dev/1.4/getting-started/quick-tour/
|
---|
| 421 | https://github.com/MorphiaOrg/morphia/tree/master/docs/reference
|
---|
| 422 |
|
---|
| 423 |
|
---|
| 424 | mongodb querying
|
---|
| 425 | https://docs.mongodb.com/manual/tutorial/query-embedded-documents/
|
---|
| 426 | https://docs.mongodb.com/manual/tutorial/query-arrays/
|
---|
| 427 | https://www.google.com/search?q=mongodb+find+subdocument&oq=mongodb+find+&aqs=chrome.0.69i59j69i57j0l4.7607j1j8&sourceid=chrome&ie=UTF-8
|
---|
| 428 | https://stackoverflow.com/questions/25586901/how-to-find-document-and-single-subdocument-matching-given-criterias-in-mongodb
|
---|
| 429 | https://stackoverflow.com/questions/21113543/mongodb-get-subdocument
|
---|
| 430 | https://stackoverflow.com/questions/36948856/find-subdocuments-in-mongo
|
---|
| 431 | https://docs.mongodb.com/v3.0/reference/operator/projection/positional/#proj._S_
|
---|
| 432 | https://www.google.com/search?q=mongodb+query+tutorial&oq=mongodb+query+tutorial&aqs=chrome..69i57j0l2j69i60l3.4719j0j7&sourceid=chrome&ie=UTF-8
|
---|
| 433 | https://blog.exploratory.io/an-introduction-to-mongodb-query-for-beginners-bd463319aa4c
|
---|
| 434 | https://docs.mongodb.com/manual/reference/method/db.collection.find/
|
---|
| 435 | https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection
|
---|
[33698] | 436 | https://stackoverflow.com/questions/39641925/mongodb-aggregation-framework-to-get-frequencies-of-fields-values
|
---|
[33666] | 437 |
|
---|
[33698] | 438 | https://exploratory.io/note/kanaugust/0961813761939766
|
---|
| 439 | https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/
|
---|
| 440 | https://docs.mongodb.com/manual/aggregation/
|
---|
| 441 |
|
---|
| 442 |
|
---|
[33675] | 443 | Mongo Studio 3T documentation:
|
---|
| 444 | https://studio3t.com/download/ (also has uninstall information)
|
---|
| 445 | https://studio3t.com/download-thank-you/?OS=x64
|
---|
[33666] | 446 |
|
---|
[33675] | 447 | Google: MongoDB visualization
|
---|
| 448 | MongoDB visualization map
|
---|
| 449 | MongoDB Charts
|
---|
| 450 | (Open source visualisation tools)
|
---|
| 451 |
|
---|
| 452 | json map visualizer
|
---|
| 453 | geojson.tools
|
---|
[33666] | 454 | -------------------
|
---|
| 455 |
|
---|
| 456 | Some queries with results:
|
---|
| 457 |
|
---|
| 458 | # Num websites
|
---|
| 459 | db.getCollection('Websites').find({}).count()
|
---|
[33804] | 460 | 1445
|
---|
[33666] | 461 |
|
---|
| 462 | # Num webpages
|
---|
| 463 | db.getCollection('Webpages').find({}).count()
|
---|
[33675] | 464 | X75139
|
---|
| 465 | 117496
|
---|
[33666] | 466 |
|
---|
[33813] | 467 | # Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI)
|
---|
[33666] | 468 | db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
|
---|
| 469 | 361
|
---|
| 470 |
|
---|
[33804] | 471 | # Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
|
---|
| 472 | db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
|
---|
| 473 | 868
|
---|
| 474 |
|
---|
| 475 | # Obviously, the union of the above two will be identical to numPagesContainingMRI:
|
---|
| 476 | db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
|
---|
| 477 | 868
|
---|
| 478 |
|
---|
[33666] | 479 | # Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
|
---|
| 480 | db.getCollection('Webpages').find({isMRI:true}).count()
|
---|
| 481 | X5224
|
---|
[33675] | 482 | X5215
|
---|
| 483 | db.getCollection('Webpages').find({isMRI:true}).count()
|
---|
| 484 | 7818
|
---|
[33666] | 485 |
|
---|
| 486 | # Number of pages that contain any number of MRI sentences
|
---|
| 487 | db.getCollection('Webpages').find({containsMRI: true}).count()
|
---|
[33675] | 488 | X12858
|
---|
| 489 | 20371
|
---|
[33666] | 490 |
|
---|
[33675] | 491 |
|
---|
[33666] | 492 | # Number of sites with URLs containing /mi(/)
|
---|
[33800] | 493 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
|
---|
[33813] | 494 | X 153
|
---|
| 495 | # Number of sites with URLs containing /mi(/) OR http(s)://mi.*
|
---|
| 496 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
|
---|
| 497 | 670
|
---|
[33666] | 498 |
|
---|
| 499 | # Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls
|
---|
[33800] | 500 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
|
---|
[33813] | 501 | X 147
|
---|
| 502 | # Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* in any of its sub-urls
|
---|
| 503 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
|
---|
| 504 | 656
|
---|
[33666] | 505 |
|
---|
[33813] | 506 | # 6 sites with URLs containing /mi(/) that are in NZ
|
---|
[33800] | 507 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count()
|
---|
[33813] | 508 | X 6
|
---|
| 509 | # 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ
|
---|
| 510 | 14
|
---|
[33666] | 511 |
|
---|
[33804] | 512 |
|
---|
[33666] | 513 | # sort websites that contain /mi(/) in path by geoLocationCountryCode
|
---|
| 514 | # https://www.quackit.com/mongodb/tutorial/mongodb_sort_query_results.cfm
|
---|
[33800] | 515 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).sort({geoLocationCountryCode: 1})
|
---|
[33666] | 516 |
|
---|
[33675] | 517 | Actually, I want to sort by count. See https://docs.mongodb.com/manual/reference/operator/aggregation/sortByCount/
|
---|
[33666] | 518 |
|
---|
[33675] | 519 |
|
---|
[33698] | 520 | # PROJECTION:
|
---|
[33800] | 521 | db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInPath: 1})
|
---|
[33675] | 522 |
|
---|
[33698] | 523 | https://docs.mongodb.com/manual/aggregation/
|
---|
[33710] | 524 | EXAMPLE:
|
---|
[33698] | 525 | db.orders.aggregate([
|
---|
| 526 | { $match: { status: "A" } },
|
---|
| 527 | { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
|
---|
| 528 | ])
|
---|
| 529 |
|
---|
[33710] | 530 | X db.Websites.aggregate([{ $match:{urlContainsLangCodeInPath:true}}, $group: {geoLocationCountryCode:1, total: $count}])
|
---|
[33698] | 531 |
|
---|
[33710] | 532 |
|
---|
| 533 | X db.Websites.aggregate([
|
---|
| 534 | { $match:{urlContainsLangCodeInPath:true}},
|
---|
| 535 | {$group: {geoLocationCountryCode:1}}
|
---|
| 536 | ])
|
---|
| 537 |
|
---|
| 538 | WORKS (but an "unwind" will get rid of "null"):
|
---|
| 539 | db.Websites.aggregate([
|
---|
| 540 | { $match:{urlContainsLangCodeInPath:true}},
|
---|
| 541 | {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}}},
|
---|
| 542 | { $sort : { count : -1} }
|
---|
| 543 | ])
|
---|
| 544 |
|
---|
| 545 |
|
---|
| 546 | # COUNT OF ALL GEOLOCATION COUNTRIES
|
---|
| 547 | #https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
|
---|
| 548 | # LIST
|
---|
| 549 | db.Websites.distinct('geoLocationCountryCode');
|
---|
| 550 |
|
---|
| 551 | # COUNT
|
---|
| 552 | db.Websites.distinct('geoLocationCountryCode').length;
|
---|
| 553 |
|
---|
| 554 | # A COUNT WITH QUERY - https://docs.mongodb.com/manual/reference/command/distinct/#dbcmd.distinct
|
---|
| 555 |
|
---|
| 556 | db.runCommand ( { distinct: "Websites", key: "geoLocationCountryCode", query: { "urlContainsLangCodeInPath": true} } );
|
---|
| 557 |
|
---|
| 558 | # DISTINCT WITH QUERY WITHOUT COUNT - https://docs.mongodb.com/manual/reference/method/db.collection.distinct/
|
---|
| 559 | db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true});
|
---|
| 560 |
|
---|
| 561 | #SORTED - https://stackoverflow.com/questions/4759437/get-distinct-values-with-sorted-data
|
---|
| 562 | db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}).sort();
|
---|
| 563 |
|
---|
| 564 |
|
---|
[33787] | 565 | # count of all sites for which the geolocation is UNKNOWN
|
---|
| 566 | db.getCollection('Websites').find({geoLocationCountryCode: {$eq:"UNKNOWN"}}).count()
|
---|
| 567 |
|
---|
| 568 |
|
---|
[33710] | 569 | # AGGREGATION QUERIES THAT WORK:
|
---|
| 570 | #https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
|
---|
| 571 |
|
---|
[33787] | 572 | WORKS:
|
---|
| 573 | // count of country codes for all sites
|
---|
[33710] | 574 | db.Websites.aggregate([
|
---|
[33787] | 575 |
|
---|
| 576 | { $unwind: "$geoLocationCountryCode" },
|
---|
[33710] | 577 | {
|
---|
[33787] | 578 | $group: {
|
---|
| 579 | _id: "$geoLocationCountryCode",
|
---|
| 580 | count: { $sum: 1 }
|
---|
| 581 | }
|
---|
| 582 | },
|
---|
| 583 | { $sort : { count : -1} }
|
---|
| 584 | ]);
|
---|
| 585 |
|
---|
[33804] | 586 | // count of country codes for sites that have at least one page detected as MRI
|
---|
[33787] | 587 |
|
---|
[33804] | 588 | db.Websites.aggregate([
|
---|
| 589 | {
|
---|
| 590 | $match: {
|
---|
| 591 | numPagesInMRI: {$gt: 0}
|
---|
| 592 | }
|
---|
| 593 | },
|
---|
| 594 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 595 | {
|
---|
| 596 | $group: {
|
---|
| 597 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 598 | count: { $sum: 1 }
|
---|
| 599 | }
|
---|
| 600 | },
|
---|
| 601 | { $sort : { count : -1} }
|
---|
| 602 | ]);
|
---|
| 603 |
|
---|
| 604 | // count of country codes for sites that have at least one page containing at least one sentence detected as MRI
|
---|
| 605 | db.Websites.aggregate([
|
---|
| 606 | {
|
---|
| 607 | $match: {
|
---|
| 608 | numPagesContainingMRI: {$gt: 0}
|
---|
| 609 | }
|
---|
| 610 | },
|
---|
| 611 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 612 | {
|
---|
| 613 | $group: {
|
---|
| 614 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 615 | count: { $sum: 1 }
|
---|
| 616 | }
|
---|
| 617 | },
|
---|
| 618 | { $sort : { count : -1} }
|
---|
| 619 | ]);
|
---|
| 620 |
|
---|
| 621 |
|
---|
[33787] | 622 | WORKS:
|
---|
[33813] | 623 | // count of country codes for sites that have /mi(/) or http(s)://mi.* in URL path
|
---|
[33787] | 624 |
|
---|
| 625 | db.Websites.aggregate([
|
---|
| 626 | {
|
---|
[33710] | 627 | $match: {
|
---|
| 628 | urlContainsLangCodeInPath: true
|
---|
| 629 | }
|
---|
| 630 | },
|
---|
| 631 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 632 | {
|
---|
| 633 | $group: {
|
---|
| 634 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 635 | count: { $sum: 1 }
|
---|
| 636 | }
|
---|
| 637 | },
|
---|
[33722] | 638 | { $sort : { count : -1} }
|
---|
[33710] | 639 | ]);
|
---|
| 640 |
|
---|
| 641 |
|
---|
| 642 | WORKS:
|
---|
| 643 | db.Websites.aggregate([
|
---|
| 644 | {
|
---|
| 645 | $match: {
|
---|
| 646 | geoLocationCountryCode: {$ne : "UNKNOWN"}
|
---|
| 647 | }
|
---|
| 648 | },
|
---|
| 649 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 650 | {
|
---|
| 651 | $group: {
|
---|
| 652 | _id: "$geoLocationCountryCode",
|
---|
| 653 | count: { $sum: 1 }
|
---|
| 654 | }
|
---|
| 655 | },
|
---|
[33722] | 656 | { $sort : { count : -1} }
|
---|
[33710] | 657 | ]);
|
---|
| 658 |
|
---|
| 659 | WORKS:
|
---|
| 660 | db.Websites.aggregate([
|
---|
| 661 | {
|
---|
| 662 | $match: {
|
---|
| 663 | "urlContainsLangCodeInPath": true
|
---|
| 664 | }
|
---|
| 665 | },
|
---|
| 666 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 667 | {
|
---|
| 668 | $group: {
|
---|
| 669 | _id: "$geoLocationCountryCode",
|
---|
| 670 | count: { $sum: 1 }
|
---|
| 671 | }
|
---|
| 672 | },
|
---|
[33722] | 673 | { $sort : { count : -1} }
|
---|
[33710] | 674 | ]);
|
---|
| 675 |
|
---|
| 676 |
|
---|
| 677 | KEEP ADDITIONAL FIELDS - https://stackoverflow.com/questions/16662405/mongo-group-query-how-to-keep-fields:
|
---|
| 678 | a. KEEPS ONLY FIRST DOMAIN URL FOR EACH COUNTED COUNTRY CODE:
|
---|
| 679 |
|
---|
| 680 | db.Websites.aggregate([
|
---|
| 681 | {
|
---|
| 682 | $match: {
|
---|
| 683 | "urlContainsLangCodeInPath": true
|
---|
| 684 | }
|
---|
| 685 | },
|
---|
| 686 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 687 | {
|
---|
| 688 | $group: {
|
---|
| 689 | _id: "$geoLocationCountryCode", count: { $sum: 1 },
|
---|
| 690 | domain: {$first: '$domain'}
|
---|
| 691 | }
|
---|
| 692 | },
|
---|
| 693 | { $sort : { count : -1} }
|
---|
| 694 | ]);
|
---|
| 695 |
|
---|
| 696 | b. KEEP ALL DOMAIN URLS:
|
---|
| 697 | db.Websites.aggregate([
|
---|
| 698 | {
|
---|
| 699 | $match: {
|
---|
| 700 | "urlContainsLangCodeInPath": true
|
---|
| 701 | }
|
---|
| 702 | },
|
---|
| 703 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 704 | {
|
---|
| 705 | $group: {
|
---|
| 706 | _id: "$geoLocationCountryCode",
|
---|
| 707 | count: { $sum: 1 },
|
---|
| 708 | domain: { $addToSet: '$domain' }
|
---|
| 709 | }
|
---|
| 710 | },
|
---|
| 711 | { $sort : { count : -1} }
|
---|
| 712 | ]);
|
---|
| 713 |
|
---|
| 714 |
|
---|
| 715 | # WANT TO GET THE ABOVE INTO WORLD MAP, use geojson.tools found by Dr Bainbridge
|
---|
| 716 | geojson.tools
|
---|
| 717 | USAGE: https://www.here.xyz/viewer-tool/
|
---|
| 718 |
|
---|
| 719 |
|
---|
[33698] | 720 | AIMS:
|
---|
[33675] | 721 | * Identify where Maori language is online.
|
---|
| 722 | * How can we identify high quality sites that would be good for a corpus.
|
---|
| 723 | (Related work for other languages to quantifiably answer that)
|
---|
| 724 |
|
---|
[33806] | 725 | data-preparation
|
---|
| 726 | docs
|
---|
[33698] | 727 |
|
---|
| 728 |
|
---|
[33806] | 729 | ------------------------------------------
|
---|
[33698] | 730 |
|
---|
[33806] | 731 | BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
|
---|
| 732 | ---
|
---|
[33698] | 733 |
|
---|
[33806] | 734 | # https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or
|
---|
| 735 | # https://docs.mongodb.com/manual/reference/operator/query/and/
|
---|
[33710] | 736 |
|
---|
[33806] | 737 | # 1. all the websites which are from NZ:
|
---|
| 738 | db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count()
|
---|
| 739 | 128
|
---|
[33710] | 740 |
|
---|
[33806] | 741 | # 2. all the websites that have /mi in URL path which are from NZ:
|
---|
| 742 | db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]})
|
---|
| 743 | 6
|
---|
[33710] | 744 |
|
---|
[33806] | 745 | # 3. all the websites that don't have /mi in URLpath
|
---|
| 746 | db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count()
|
---|
| 747 | 1292
|
---|
| 748 |
|
---|
| 749 | # 4. all the websites that don't have /mi, or if they do are from NZ
|
---|
| 750 | # (should be the sum of the above points 2 and 3 above)
|
---|
| 751 | db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count()
|
---|
| 752 | 1298
|
---|
| 753 |
|
---|
| 754 | # 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ
|
---|
| 755 | # These are the TENTATIVE NON-PRODUCT SITES
|
---|
| 756 | # Should be less than the point 4, but more than 1 to 3
|
---|
[33813] | 757 |
|
---|
[33806] | 758 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count()
|
---|
[33813] | 759 | X 859
|
---|
[33806] | 760 |
|
---|
[33813] | 761 | Now with http(s)://mi.* also excluded, the above query returns a count of:
|
---|
| 762 | 389
|
---|
| 763 |
|
---|
| 764 |
|
---|
| 765 | BUT THIS IS THE CORRECT VERSION OF THE QUERY:
|
---|
| 766 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {urlContainsLangCodeInPath: false}]}]}).count()
|
---|
| 767 | 389
|
---|
| 768 |
|
---|
| 769 |
|
---|
[33806] | 770 | # 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix)
|
---|
| 771 | # Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
|
---|
| 772 | db.Websites.aggregate([
|
---|
| 773 | {
|
---|
| 774 | $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}
|
---|
| 775 | },
|
---|
| 776 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 777 | {
|
---|
| 778 | $group: {
|
---|
| 779 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 780 | count: { $sum: 1 }
|
---|
| 781 | }
|
---|
| 782 | },
|
---|
| 783 | { $sort : { count : -1} }
|
---|
| 784 | ]);
|
---|
| 785 |
|
---|
| 786 | The result is very close to the same aggregate on just numPagesContainingMRI.
|
---|
| 787 |
|
---|
| 788 | That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few:
|
---|
| 789 |
|
---|
| 790 | db.Websites.aggregate([
|
---|
| 791 | {
|
---|
| 792 | $match: {
|
---|
| 793 | $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]
|
---|
| 794 | }
|
---|
| 795 | },
|
---|
| 796 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 797 | {
|
---|
| 798 | $group: {
|
---|
| 799 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 800 | count: { $sum: 1 }
|
---|
| 801 | }
|
---|
| 802 | },
|
---|
| 803 | { $sort : { count : -1} }
|
---|
| 804 | ]);
|
---|
| 805 |
|
---|
| 806 |
|
---|
| 807 | _id count
|
---|
| 808 | us 4.0
|
---|
| 809 | nz 4.0
|
---|
| 810 | au 3.0
|
---|
| 811 | ru 1.0
|
---|
| 812 | de 1.0
|
---|
| 813 |
|
---|
| 814 | Total: 13 sites that have /mi/ and are detected as having MRI content,
|
---|
| 815 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
|
---|
| 816 | 13
|
---|
| 817 |
|
---|
| 818 | Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI.
|
---|
| 819 |
|
---|
| 820 |
|
---|
| 821 | Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD!
|
---|
[33710] | 822 | /* 1 */
|
---|
| 823 | {
|
---|
[33806] | 824 | "_id" : "nz",
|
---|
| 825 | "count" : 4.0,
|
---|
| 826 | "domain" : [
|
---|
| 827 | "http://firstworldwar.tki.org.nz",
|
---|
| 828 | "http://www.firstworldwar.tki.org.nz",
|
---|
| 829 | "https://admin.teara.govt.nz",
|
---|
| 830 | "http://community.nzdl.org"
|
---|
| 831 | ]
|
---|
| 832 | }
|
---|
| 833 |
|
---|
| 834 | /* 2 */
|
---|
| 835 | {
|
---|
| 836 | "_id" : "us",
|
---|
| 837 | "count" : 4.0,
|
---|
| 838 | "domain" : [
|
---|
| 839 | "https://sexualviolence.victimsinfo.govt.nz",
|
---|
| 840 | "https://follow3rs.com",
|
---|
| 841 | "http://www.church-of-christ.org",
|
---|
| 842 | "http://www.mytrickstips.com"
|
---|
| 843 | ]
|
---|
| 844 | }
|
---|
| 845 |
|
---|
| 846 | /* 3 */
|
---|
| 847 | {
|
---|
| 848 | "_id" : "au",
|
---|
| 849 | "count" : 3.0,
|
---|
| 850 | "domain" : [
|
---|
| 851 | "https://rapuatearatika.education.govt.nz",
|
---|
| 852 | "https://www.kiwiproperty.com",
|
---|
| 853 | "https://curriculumtool.education.govt.nz"
|
---|
| 854 | ]
|
---|
| 855 | }
|
---|
| 856 |
|
---|
| 857 | /* 4 */
|
---|
| 858 | {
|
---|
| 859 | "_id" : "ru",
|
---|
| 860 | "count" : 1.0,
|
---|
| 861 | "domain" : [
|
---|
| 862 | "http://www.treningmozga.com"
|
---|
| 863 | ]
|
---|
| 864 | }
|
---|
| 865 |
|
---|
| 866 | /* 5 */
|
---|
| 867 | {
|
---|
| 868 | "_id" : "de",
|
---|
| 869 | "count" : 1.0,
|
---|
| 870 | "domain" : [
|
---|
| 871 | "http://www.almancax.com" # Website to learn German, autotranslated
|
---|
| 872 | ]
|
---|
| 873 | }
|
---|
| 874 |
|
---|
| 875 |
|
---|
| 876 | But we're not catching a potentially large number of auto-translated sites, like
|
---|
| 877 | - https://www.gigalight.com/all-languages.html
|
---|
| 878 | - http://www.hzhinew.com/
|
---|
| 879 |
|
---|
[33807] | 880 | https://culturesconnection.com/manual-or-automatic-translation/
|
---|
| 881 | Manual Or Automatic Translation?
|
---|
[33806] | 882 |
|
---|
[33807] | 883 | Automatic translation continues to improve day by day. However, it is still unable to reach perfect levels of accuracy and lacks a natural feel. Will it ever replace human translation?
|
---|
| 884 |
|
---|
[33806] | 885 | --------------
|
---|
[33807] | 886 | Mr Bill Rogers' suggestions for beginnings of trying to sieve out the auto-translated sites:
|
---|
| 887 | - skip .com. .co.<tld>. But .co.nz is also used for non-commercial sites or sites that nevertheless have high quality Maori language content.
|
---|
| 888 | - change cut-off value of OpenNLP language prediction? But for sentences and overlapping sentences, we're not using the cut-off value, we're just checking the best predicted language regardless of confidence level for this.
|
---|
[33806] | 889 |
|
---|
[33807] | 890 | - Code for (a range of) loading of language options in auto-translated sites?
|
---|
[33806] | 891 |
|
---|
[33807] | 892 | ====================
|
---|
[33806] | 893 |
|
---|
[33807] | 894 | # https://stackoverflow.com/questions/20175122/how-can-i-use-not-like-operator-in-mongodb
|
---|
[33806] | 895 |
|
---|
[33807] | 896 | Info on the sites with Maori language content that are either from NZ or have .nz domain (TLD):
|
---|
[33806] | 897 |
|
---|
[33807] | 898 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]})
|
---|
| 899 |
|
---|
| 900 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}).count()
|
---|
| 901 | 183
|
---|
[33806] | 902 |
|
---|
[33807] | 903 | Inverse: the sites detected as containing at least 1 Maori language sentence that are NOT from NZ NOR have .nz domain ending (TLD):
|
---|
| 904 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}]}).count()
|
---|
| 905 | 685
|
---|
[33806] | 906 |
|
---|
[33807] | 907 | The above two figures correctly add up to a total of 868 sites, which is the number of sites detected as containing at least 1 sentence in MRI:
|
---|
| 908 | db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
|
---|
| 909 | 868
|
---|
[33806] | 910 |
|
---|
[33807] | 911 | Without those with /mi in path:
|
---|
| 912 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]}).count()
|
---|
[33806] | 913 |
|
---|
[33807] | 914 | Now let's get a listing of all 685 sites to be manually inspected to determine whether they're auto-translated:
|
---|
[33806] | 915 |
|
---|
[33807] | 916 | /*
|
---|
| 917 | db.Websites.aggregate([
|
---|
| 918 | {
|
---|
| 919 | $match: {
|
---|
| 920 | $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]
|
---|
| 921 | }
|
---|
| 922 | },
|
---|
| 923 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 924 | {
|
---|
| 925 | $group: {
|
---|
| 926 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 927 | count: { $sum: 1 },
|
---|
| 928 | domain: { $addToSet: '$domain' }
|
---|
| 929 | }
|
---|
| 930 | },
|
---|
| 931 | { $sort : { count : -1} }
|
---|
| 932 | ]);
|
---|
[33710] | 933 | */
|
---|
[33807] | 934 | db.Websites.aggregate([
|
---|
| 935 | {
|
---|
| 936 | $match: {
|
---|
[33813] | 937 | $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}}]
|
---|
[33807] | 938 | }
|
---|
| 939 | },
|
---|
| 940 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 941 | {
|
---|
| 942 | $group: {
|
---|
| 943 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 944 | count: { $sum: 1 },
|
---|
| 945 | domain: { $addToSet: '$domain' }
|
---|
| 946 | }
|
---|
| 947 | },
|
---|
| 948 | { $sort : { count : -1} }
|
---|
| 949 | ]);
|
---|
[33710] | 950 |
|
---|
[33813] | 951 |
|
---|
| 952 | We can knock of another 54 non-NZ sites with our new urlContainsLangCodeInPathPrefix field:
|
---|
| 953 |
|
---|
| 954 | db.getCollection('Websites').find({urlContainsLangCodeInPathPrefix: true, geoLocationCountryCode: {$ne: "NZ"}, domain: {$not: /.nz$/}}).count()
|
---|
| 955 | 54
|
---|
| 956 |
|
---|
| 957 |
|
---|
| 958 | SO, can repeat query with new field "urlContainsLangCodeInPathPrefix":
|
---|
| 959 | Number of sites containing >= 1 MRI sentences that are not from NZ or of .nz TLD and which don't contain "/mi(/)" or "http(s)://mi." in URL path:
|
---|
| 960 | db.getCollection('Websites').find({$and: [
|
---|
| 961 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 962 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 963 | {domain: {$not: /.nz$/}},
|
---|
| 964 | {urlContainsLangCodeInPathSuffix: {$ne: true}},
|
---|
| 965 | {urlContainsLangCodeInPathPrefix: {$ne: true}}
|
---|
| 966 | ]}).count()
|
---|
| 967 |
|
---|
| 968 | 651
|
---|
| 969 |
|
---|
| 970 |
|
---|
| 971 | REDO THE COUNT BY COUNTRY QUERY FOR THIS:
|
---|
| 972 |
|
---|
| 973 | db.Websites.aggregate([
|
---|
| 974 | {
|
---|
| 975 | $match: {
|
---|
| 976 | $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPathSuffix: {$ne: true}}, {urlContainsLangCodeInPathPrefix: {$ne: true}}]
|
---|
| 977 | }
|
---|
| 978 | },
|
---|
| 979 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 980 | {
|
---|
| 981 | $group: {
|
---|
| 982 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 983 | count: { $sum: 1 },
|
---|
| 984 | domain: { $addToSet: '$domain' }
|
---|
| 985 | }
|
---|
| 986 | },
|
---|
| 987 | { $sort : { count : -1} }
|
---|
| 988 | ]);
|
---|
| 989 |
|
---|
| 990 |
|
---|
[33896] | 991 | AFTER BUGFIX FOR miInURLPath being set at the correct stage now:
|
---|
[33813] | 992 | db.getCollection('Websites').find(
|
---|
| 993 | {$and: [
|
---|
| 994 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 995 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 996 | {domain: {$not: /.nz$/}},
|
---|
| 997 | {urlContainsLangCodeInPath: {$ne: true}}
|
---|
| 998 | ]}).count()
|
---|
| 999 |
|
---|
| 1000 | 220
|
---|
| 1001 |
|
---|
| 1002 | db.Websites.aggregate([
|
---|
| 1003 | {
|
---|
| 1004 | $match: {
|
---|
| 1005 | $and: [
|
---|
| 1006 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1007 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 1008 | {domain: {$not: /.nz$/}},
|
---|
| 1009 | {urlContainsLangCodeInPath: {$ne: true}}
|
---|
| 1010 | ]
|
---|
| 1011 | }
|
---|
| 1012 | },
|
---|
| 1013 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1014 | {
|
---|
| 1015 | $group: {
|
---|
| 1016 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 1017 | count: { $sum: 1 },
|
---|
| 1018 | domain: { $addToSet: '$domain' }
|
---|
| 1019 | }
|
---|
| 1020 | },
|
---|
| 1021 | { $sort : { count : -1} }
|
---|
| 1022 | ]);
|
---|
| 1023 |
|
---|
[33896] | 1024 | Can inspect websites' pages for whether it's relevant vs auto-translated as follows:
|
---|
[33813] | 1025 | db.getCollection('Webpages').find({URL:/svenkirsten.com/, mriSentenceCount: {$gt: 0}})
|
---|
| 1026 |
|
---|
| 1027 |
|
---|
[33807] | 1028 | * CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/
|
---|
| 1029 | BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/
|
---|
[33710] | 1030 |
|
---|
[33816] | 1031 | * FR: 16 sites from FR
|
---|
| 1032 | http://blueheavenisland.com, http://www.blueheavenisland.com - misdetection. French Polynesia
|
---|
[33807] | 1033 | https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway
|
---|
| 1034 | http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori.
|
---|
| 1035 | !! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
|
---|
| 1036 | http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names
|
---|
[33816] | 1037 | X http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
|
---|
| 1038 | http://mahajana.net - misdetected a Japanese Zen Buddhist chant as MRI
|
---|
| 1039 | http://rapanui.fr - Rapa Nui Easter Island. Misdetected.
|
---|
| 1040 | http://www.gif.ovh - autotranslated pages. Supposedly a GIF repository
|
---|
| 1041 | http://baladeornithologique.com - misdetection of the word "Retour"
|
---|
| 1042 | http://www.gaudry.be - misdetection of Japanese hiragana etc, and French "faire", as MRI
|
---|
| 1043 | http://www.gototahiti.net - probably misdetection, see title
|
---|
| 1044 | http://www.maraamusurfskirace.com - Bora Bora, French Polynesia. Misdetected.
|
---|
| 1045 | http://www.rongo-rongo.com - appears to be related to Easter Island. Just 1 sentence however.
|
---|
| 1046 | http://pt.city-usa.net - misdetection. Hawaii.
|
---|
| 1047 | https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages.
|
---|
| 1048 | NL:
|
---|
[33823] | 1049 | (!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm]
|
---|
[33816] | 1050 | - https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL
|
---|
| 1051 | - tonhut.nl - misidentication
|
---|
| 1052 | ? http://nielsonboutique.co.uk, http://longhornlaw.net, http://tetsubo.org, http://hidsonphoto.com, http://wearehomework.com/- Feels autotranslated, but no language options visible. All SEO related
|
---|
| 1053 | - diverosa.com - Rapa Nui, Easter Island
|
---|
| 1054 | - nonlinear.demon.nl - misidentified
|
---|
| 1055 | - encyclo.co.uk - misidentification
|
---|
| 1056 | - henrifloor.nl - misidentification
|
---|
| 1057 | - http://skimap.info/ - maps, NZ placenames in PDF
|
---|
| 1058 | DK:
|
---|
[33823] | 1059 | !! ++ http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,
|
---|
[33816] | 1060 | http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com,
|
---|
| 1061 | http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com
|
---|
| 1062 | - http://www.rennertweb.de - a photogallery page mentioning NZ placenames
|
---|
| 1063 | CA:
|
---|
| 1064 | - http://bcmarina.com AND http://bckayak.com - photos with Canadian placenames
|
---|
| 1065 | - http://www.myrasplace.net - pagse of photos, captions involving NZ placenames
|
---|
| 1066 | ~ http://00.gs/Maniapoto;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing
|
---|
| 1067 | - aguadilla.airport-authority.com - misidentification
|
---|
| 1068 | - https://articles.imperialtometric.com - misidentification
|
---|
| 1069 | - http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames
|
---|
[33813] | 1070 | DE:
|
---|
[33816] | 1071 | - http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful
|
---|
| 1072 | !! https://www.cartogiraffe.com/ and https://www.cartogiraffe.com - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
|
---|
[33813] | 1073 | ~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden"
|
---|
| 1074 | - herocity - autotranslated
|
---|
| 1075 | - weltderberge.de - 3 pages mention NZ mountains by name.
|
---|
| 1076 | ~ (arts.mythologica.fr) https://mythologica.fr/oceanie/texte/pantheon_polynesien.pdf - mentions certain Maori Gods and other Polynesian Gods by name.
|
---|
| 1077 | - https://traynews.com - nothing in MRI, misdetected
|
---|
| 1078 | ~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names
|
---|
| 1079 | - http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI
|
---|
[33816] | 1080 | X https://afrikhepri.org/mi/ - autotranslated
|
---|
[33813] | 1081 | - https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word
|
---|
| 1082 | - etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected
|
---|
[33816] | 1083 | - https://www.you-fly.com - misdetection of German "Warum?" as MRI
|
---|
| 1084 | - http://vulkane.ch - misdetected pages on Hawaiian volcanoes.
|
---|
| 1085 | - http://www.stephe.de - photos from NZ captioned with NZ placenames
|
---|
| 1086 | - http://insecta.pro - misdetection
|
---|
| 1087 | - http://m.distanta.1km.net - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence.
|
---|
| 1088 | - https://ersatzteile-fachversand.de - German misdetected as Maori.
|
---|
| 1089 | - https://laskar02cinta.page.tl/Info.htm - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense.
|
---|
| 1090 | - http://www.behlig.de - misdetection. Photos from Hawaii.
|
---|
| 1091 | !! http://www.udhr.de - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available.
|
---|
[33813] | 1092 | - ITALY:
|
---|
| 1093 | http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename
|
---|
| 1094 | http://www.marcosanti.it/Reportage/Oceania_ph/Nuova_Zelanda/ - each photo captioned by NZ placename
|
---|
| 1095 | http://www.pegasoesmicamion.com/ - REO abbreviation misidentified, also in REO%20PUBLICIDAD.htm
|
---|
| 1096 | - AUSTRIA:
|
---|
| 1097 | petit-prince.at - Tahitian and Wayuu (Venezuela) translations of Le Petit Prince
|
---|
| 1098 | http://www.tmtmm.net/newzealand - photos from NZ named after places and people's names
|
---|
| 1099 | - ROMANIA: parohiauceadesus.ro - Sentences of single Romanian words misidentified.
|
---|
| 1100 | - ISRAEL:
|
---|
| 1101 | http://www.daat.ac.il - misidentification of "no." as MRI, and Hebrew words.
|
---|
| 1102 | https://www.hitiaotera.com/ - misidentifiation of Tahitian pages
|
---|
| 1103 | - RUSSIA: https://www.gismeteo.lv - misidentification of an email address
|
---|
| 1104 | - JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI
|
---|
[33816] | 1105 | !! - Ireland, ie: https://coggle.it
|
---|
[33813] | 1106 | - IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website
|
---|
[33816] | 1107 | - CZECH republic:
|
---|
| 1108 | ? https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English
|
---|
| 1109 | !! http://www.henryklahola.nazory.cz/094.Maori.htm and http://henryklahola.nazory.cz variant
|
---|
| 1110 | http://about.ilikeyou.com - dating site. Misidentification.
|
---|
| 1111 | - SPAIN:
|
---|
| 1112 | !! https://www.uv.es/~pla/red.net/intmaori.html
|
---|
| 1113 | https://www.reclamaciondevuelos.com - 2 occurrences of the word "kiwi"
|
---|
| 1114 | http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename
|
---|
| 1115 | http://www.cruceros-princess.mx/princessMX/Oferta_Cruzeiros_Polinesia.html - Polynesian placenames
|
---|
[33813] | 1116 | - SINGAPORE: https://omg-solutions.com - autotranslated
|
---|
| 1117 | - TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated
|
---|
| 1118 | - MEXICO: http://www.gelbukh.com - misidentification, lines of just numbers or phrases like "Area Chair" in English and Russian CVs.
|
---|
| 1119 | - FINLAND: http://pertti.com - travelogue, placenames
|
---|
| 1120 | - SWITZERLAND CH:
|
---|
| 1121 | nicoledidi.ch - blog, placenames
|
---|
| 1122 | https://photos.axelebert.org - Tahiti related content
|
---|
| 1123 | - UNKNOWN: https://www.viveipcl.com: tours website, placenames mentioned
|
---|
| 1124 | #- EU: https://www.the-good-stuff-factory.be/mi/ : Autotranslated
|
---|
| 1125 | !! - BULGARIA: http://anitra.net/activism/humanrights/UDHR/rrt_print.htm (2 pages)
|
---|
| 1126 |
|
---|
| 1127 |
|
---|
| 1128 | TREATING AUSTRALIA AND GREAT BRITAIN MORE SPECIALLY (don't ignore /mi in URL, same as with NZ, but do leave out .nz TLDs since we cover them under NZ - TODO: later find country codes of all .nz TLDs):
|
---|
| 1129 | [nothing found under "UK", only under "GB"]
|
---|
| 1130 |
|
---|
| 1131 | db.getCollection('Websites').find({
|
---|
| 1132 | domain: {$not: /.nz$/},
|
---|
| 1133 | numPagesContainingMRI: {$gt: 0},
|
---|
| 1134 | $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}]
|
---|
| 1135 | }).count()
|
---|
| 1136 | 11
|
---|
| 1137 |
|
---|
| 1138 | db.Websites.aggregate([
|
---|
| 1139 | {
|
---|
| 1140 | $match: {
|
---|
| 1141 | domain: {$not: /.nz$/},
|
---|
| 1142 | numPagesContainingMRI: {$gt: 0},
|
---|
| 1143 | $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}]
|
---|
| 1144 | }
|
---|
| 1145 | },
|
---|
| 1146 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1147 | {
|
---|
| 1148 | $group: {
|
---|
| 1149 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 1150 | count: { $sum: 1 },
|
---|
| 1151 | domain: { $addToSet: '$domain' }
|
---|
| 1152 | }
|
---|
| 1153 | },
|
---|
| 1154 | { $sort : { count : -1} }
|
---|
| 1155 | ]);
|
---|
| 1156 |
|
---|
| 1157 | AUSTRALIA:
|
---|
| 1158 | !! https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]
|
---|
| 1159 | ? http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!")
|
---|
[33849] | 1160 | X!! https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd - site of individual pages (like docs.google.com). This one has a relevant infogram image. But it's English with MRI in the image legend and captions.
|
---|
[33813] | 1161 | !! https://koreromaori.com - some actual Maori language sentences
|
---|
| 1162 | http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames
|
---|
| 1163 |
|
---|
| 1164 | UK:
|
---|
| 1165 | http://www.wordsearchfun.com/200628_Word_Find_wordsearch.html - 2 word games with Maori words (one of them has 3 different views, e.g. print view)
|
---|
| 1166 | ? https://omniatlas.com/maps/australasia/18400206/plain/ - historical map with Maori iwi names over NZ map regions
|
---|
| 1167 | ? https://omniatlas.com/maps/australasia/18400206/ - historical map of Australia and NZ at the time of the Treaty of Waitangi, with events marked in English
|
---|
| 1168 | https://centrallanguageschool.com - AUTOTRANSLATED
|
---|
| 1169 | https://www.solasolv.com - Autotranslated product site
|
---|
| 1170 | http://mikestephens.co.uk/ - photo captions containing NZ placenames
|
---|
| 1171 | http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames
|
---|
[33816] | 1172 |
|
---|
[33807] | 1173 | --------------
|
---|
[33710] | 1174 |
|
---|
[33807] | 1175 | GETTING TABLE DATA OUT OF MONGO DB:
|
---|
[33710] | 1176 |
|
---|
[33807] | 1177 | https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo
|
---|
| 1178 | "export to file" as in a spreadsheet like to a .csv?
|
---|
[33710] | 1179 |
|
---|
[33807] | 1180 | IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo):
|
---|
[33710] | 1181 |
|
---|
[33807] | 1182 | 1. In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything
|
---|
[33710] | 1183 |
|
---|
[33807] | 1184 | 2. paste everything into this website: https://json-csv.com/
|
---|
[33710] | 1185 |
|
---|
[33807] | 1186 | 3. click the download button and now you have it in a spreadsheet.
|
---|
[33710] | 1187 |
|
---|
| 1188 |
|
---|
[33807] | 1189 | https://json-csv.com/
|
---|
[33710] | 1190 |
|
---|
| 1191 |
|
---|
[33807] | 1192 | ---------------------
|
---|
[33813] | 1193 |
|
---|
| 1194 | Count of websites that have at least 1 page containing at least one sentence detected as MRI
|
---|
| 1195 | AND which websites have mi in the URL path:
|
---|
| 1196 |
|
---|
| 1197 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
|
---|
| 1198 |
|
---|
| 1199 | 491
|
---|
| 1200 |
|
---|
| 1201 |
|
---|
| 1202 |
|
---|
| 1203 | # The websites that have some MRI detected AND which are either in NZ or with NZ TLD
|
---|
| 1204 | # or (so if they're from overseas) don't contain /mi or mi.* in URL path:
|
---|
| 1205 |
|
---|
| 1206 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}]}).count()
|
---|
| 1207 | 396
|
---|
| 1208 |
|
---|
| 1209 | Include Australia (to get the valid "kiwiproperty.com" website included in the result list):
|
---|
| 1210 |
|
---|
| 1211 | db.getCollection('Websites').find({$and: [
|
---|
| 1212 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1213 | {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
|
---|
| 1214 | ]}).count()
|
---|
| 1215 |
|
---|
| 1216 | 397
|
---|
| 1217 |
|
---|
| 1218 | # aggregate results by a count of country codes
|
---|
| 1219 | db.Websites.aggregate([
|
---|
| 1220 | {
|
---|
| 1221 | $match: {
|
---|
| 1222 | $and: [
|
---|
| 1223 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1224 | {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
|
---|
| 1225 | ]
|
---|
| 1226 | }
|
---|
| 1227 | },
|
---|
| 1228 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1229 | {
|
---|
| 1230 | $group: {
|
---|
| 1231 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 1232 | count: { $sum: 1 }
|
---|
| 1233 | }
|
---|
| 1234 | },
|
---|
| 1235 | { $sort : { count : -1} }
|
---|
| 1236 | ]);
|
---|
| 1237 |
|
---|
| 1238 |
|
---|
| 1239 | # Just considering those sites outside NZ or not with .nz TLD:
|
---|
| 1240 | db.Websites.aggregate([
|
---|
| 1241 | {
|
---|
| 1242 | $match: {
|
---|
| 1243 | $and: [
|
---|
| 1244 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 1245 | {domain: {$not: /\.nz/}},
|
---|
| 1246 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1247 | {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
|
---|
| 1248 | ]
|
---|
| 1249 | }
|
---|
| 1250 | },
|
---|
| 1251 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1252 | {
|
---|
| 1253 | $group: {
|
---|
| 1254 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 1255 | count: { $sum: 1 },
|
---|
| 1256 | domain: { $addToSet: '$domain' }
|
---|
| 1257 | }
|
---|
| 1258 | },
|
---|
| 1259 | { $sort : { count : -1} }
|
---|
| 1260 | ]);
|
---|
| 1261 |
|
---|
| 1262 |
|
---|
[33823] | 1263 | # counts by country code excluding NZ related sites
|
---|
| 1264 | db.getCollection('Websites').find({$and: [
|
---|
| 1265 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 1266 | {domain: {$not: /\.nz/}},
|
---|
| 1267 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1268 | {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
|
---|
| 1269 | ]}).count()
|
---|
| 1270 |
|
---|
| 1271 | 221 websites
|
---|
| 1272 |
|
---|
| 1273 |
|
---|
[33813] | 1274 | # But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
|
---|
| 1275 | db.getCollection('Websites').find({$and: [
|
---|
| 1276 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1277 | {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
|
---|
| 1278 | ]}).count()
|
---|
| 1279 |
|
---|
| 1280 | 176
|
---|
| 1281 |
|
---|
| 1282 | (Total is 221+176 = 397, which adds up).
|
---|
| 1283 |
|
---|
| 1284 | # Get the count (and domain listing) output put under a hardcoded _id of "nz":
|
---|
| 1285 | db.Websites.aggregate([
|
---|
| 1286 | {
|
---|
| 1287 | $match: {
|
---|
| 1288 | $and: [
|
---|
| 1289 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1290 | {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
|
---|
| 1291 | ]
|
---|
| 1292 | }
|
---|
| 1293 | },
|
---|
| 1294 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1295 | {
|
---|
| 1296 | $group: {
|
---|
| 1297 | _id: "nz",
|
---|
| 1298 | count: { $sum: 1 },
|
---|
| 1299 | domain: { $addToSet: '$domain' }
|
---|
| 1300 | }
|
---|
| 1301 | },
|
---|
| 1302 | { $sort : { count : -1} }
|
---|
| 1303 | ]);
|
---|
[33816] | 1304 |
|
---|
| 1305 |
|
---|
| 1306 | -----------------------
|
---|
[33823] | 1307 | US:
|
---|
[33816] | 1308 | Done: manually inspected 68/117 sites
|
---|
| 1309 |
|
---|
[33823] | 1310 | TOTAL US: 4+7+7+4+3=25
|
---|
| 1311 |
|
---|
[33816] | 1312 | DEFINITELY:
|
---|
| 1313 | + http://anglicanhistory.org,
|
---|
| 1314 | + http://www.unicode.org, [Universal declaration of Human Rights]
|
---|
| 1315 | + https://static-promote.weebly.com,
|
---|
[33849] | 1316 | + http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY. At least short or partial MRI sentences.]
|
---|
[33816] | 1317 |
|
---|
| 1318 | BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations:
|
---|
| 1319 | + http://bahaiprayers.net, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too]
|
---|
| 1320 | + https://biblehub.com,
|
---|
| 1321 | + http://www.muhammad.com, [possibly not autotranslated]
|
---|
| 1322 | + http://www.godrules.net, [possibly not autotranslated]
|
---|
| 1323 | + http://m.biblepub.com,
|
---|
| 1324 | + http://www.krassotkin.ru, [probably real translations, as there are multiple Dutch translations from different sources provided]
|
---|
| 1325 | + http://www.gotquestions.org, [doesn't appear autotranslated]
|
---|
| 1326 | X https://ebible.org, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori]
|
---|
| 1327 | X https://www.bible.com, doesn't have Maori translation. Misdetected.
|
---|
| 1328 | X https://wol.jw.org, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected
|
---|
| 1329 | X https://png.bible, [misdetected, Papua New Guinea]
|
---|
| 1330 | X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters.
|
---|
| 1331 |
|
---|
[33896] | 1332 | CHECK, PROBABLY HAS MRI - PROCESSED:
|
---|
[33816] | 1333 | !! https://maorinews.com,
|
---|
| 1334 | !! http://maaori.com,
|
---|
[33849] | 1335 | !!+ http://kiaorahola.blogspot.com,
|
---|
[33816] | 1336 | + https://kjohnsonnz.blogspot.com,
|
---|
| 1337 | + http://pumanawawhangara.blogspot.com,
|
---|
| 1338 | + http://dannykahei.tripod.com,
|
---|
[33849] | 1339 | + http://burkekm001.tripod.com,
|
---|
[33816] | 1340 | + http://tkkpipipaopao.blogspot.com,
|
---|
| 1341 | + http://manateina.blogspot.com,
|
---|
| 1342 | ? tkkpipipaopao.blogspot.com? http://rangiwewehi.com, [English, but community]
|
---|
| 1343 | ? https://www.terakau.org, [COMMUNITY, but English]
|
---|
| 1344 | ? https://www.pipirikiapapatuanuku.org, [COMMUNITY?, in English, environment site]
|
---|
| 1345 | ~ http://georgegi.tripod.com,
|
---|
| 1346 | ~ http://ngarangatahi.tripod.com, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden]
|
---|
| 1347 | X http://fhr.kiwicelts.com,
|
---|
| 1348 | X http://tkrow.tripod.com, [English, background of NZ place]
|
---|
| 1349 | X http://www.mkiwi.com, - placenames
|
---|
| 1350 | X http://www.waimate.com, [English, NZ place]
|
---|
| 1351 |
|
---|
[33896] | 1352 | MAYBE HAS MRI, INSPECT - PROCESSED:
|
---|
[33816] | 1353 | ? https://www.natekore2018.com, [lots of English, but COMMUNITY, CULTURE]
|
---|
| 1354 | + http://tatai09.blogspot.com,
|
---|
| 1355 | + http://www.twttoa.com,
|
---|
| 1356 | + http://tuhua2010.blogspot.com,
|
---|
| 1357 | X http://www.huapala.org, [misdetected, Hawaiian]
|
---|
| 1358 | X https://www.vaihaunui.net, [misdetected, Tahiti]
|
---|
| 1359 | X https://www.kaifineart.com, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected]
|
---|
| 1360 | X http://mahoraroom8.blogspot.com, [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI
|
---|
| 1361 | + http://piripi.blogspot.com,
|
---|
| 1362 | X http://www.hiroa.pf, [misdetected. Crawled content appears Polynesian not Maori]
|
---|
| 1363 | X http://korora.econ.yale.edu, [NZ place photo caption]
|
---|
| 1364 | X https://www.poehalisnami.ua, [mostly Cyrillic, with some NZ or Polynesian names misdetected]
|
---|
| 1365 | X http://hannas-reiseblog.blogspot.com - one page contained NZ placenames, another had a word misdetected
|
---|
| 1366 |
|
---|
| 1367 |
|
---|
| 1368 | + https://www.breaker.audio, [audio, with occasional English.]
|
---|
| 1369 | ? https://livestream.com, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?]
|
---|
| 1370 |
|
---|
| 1371 | X https://docs.google.com, timetable with occasional Maori language word
|
---|
| 1372 | + https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position.
|
---|
| 1373 | http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
|
---|
| 1374 |
|
---|
| 1375 |
|
---|
| 1376 | PINTEREST
|
---|
| 1377 | + https://in.pinterest.com/pin/317363104978423418/
|
---|
| 1378 | "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google"
|
---|
| 1379 | ? https://za.pinterest.com/pin/524669425310419500/
|
---|
| 1380 | Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE]
|
---|
| 1381 | [The other pinterest detected as numPagesContainingMRI > 0 was misdetected]
|
---|
| 1382 |
|
---|
| 1383 | https://nl.pinterest.com,
|
---|
| 1384 | https://www.pinterest.jp,
|
---|
| 1385 | https://www.pinterest.it,
|
---|
| 1386 | https://www.pinterest.co.uk,
|
---|
| 1387 | https://www.pinterest.ca,
|
---|
| 1388 | https://za.pinterest.com,
|
---|
| 1389 | https://www.pinterest.fr,
|
---|
| 1390 | https://in.pinterest.com,
|
---|
| 1391 |
|
---|
| 1392 | MORE BLOGSPOTS
|
---|
| 1393 | X http://word-dialect.blogspot.com, [Indonesian, misdetected]
|
---|
| 1394 | ~ http://atopeconlostopes.blogspot.com, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.]
|
---|
| 1395 | X http://lianzaconference2012.blogspot.com, [NZ placename or institution]
|
---|
| 1396 | ? http://mrshamiltonskoolkidz.blogspot.com, [te reo Maori related school activities. Described in English.]
|
---|
| 1397 | X http://capsuraotearoa.blogspot.com, [blog in French, photo captions contain NZ placenames]
|
---|
| 1398 | X http://blogdepasopor.blogspot.com, [blog in French, Rapa Nui/Easter Island related content, misdetected.]
|
---|
| 1399 |
|
---|
| 1400 |
|
---|
| 1401 | UNLIKELY
|
---|
| 1402 | ?? http://naturalfatburner.net, http://naturalfatburner.net/NoNonsenseTed/fatloss-mao/ feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick]
|
---|
| 1403 |
|
---|
| 1404 |
|
---|
| 1405 | BLACKLIST:
|
---|
| 1406 | X http://ww25.milfsplease.com,
|
---|
| 1407 | X http://www.the-naked.com
|
---|
| 1408 |
|
---|
| 1409 | OTHER:
|
---|
| 1410 | X http://seapixonline.com, https://www.seapixonline.com, [photo captions of ships. Sometimes misdetected Japanese words as MRI.]
|
---|
| 1411 | X http://www.code-postal.com, https://www.code-postal.com, [not more than 1 sentence detected as in MRI]
|
---|
| 1412 | X https://www.dbnames.net, [Name database, lots misdetected]
|
---|
| 1413 |
|
---|
[33896] | 1414 | STILL TO DO LIST - PROCESSED:
|
---|
[33816] | 1415 |
|
---|
| 1416 | X https://www.myadsclassified.com, [misdetected 3 short English sentences as MRI]
|
---|
| 1417 | X http://www.whoisthatr.com, [misdetected short English sentence as MRI]
|
---|
| 1418 | X https://www.oemsec.com, [autotranslated product site]
|
---|
| 1419 | X http://svenskadress.net, [linkfarm like site of related junk links, contained URLs misdetected as MRI]
|
---|
| 1420 |
|
---|
| 1421 | X https://www.webwiki.com, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ]
|
---|
| 1422 | X http://mikebonnice.com, [Hawaiian and Tahiti related content misdetected]
|
---|
| 1423 | X http://www.hudl.com, [misdetected short English sentence as MRI]
|
---|
| 1424 | X http://www.wikitree.com, [misdetected short English sentence as MRI]
|
---|
| 1425 | X http://shuttersportnelson.photoshelter.com, [image captions of "Wairua Warrior"]
|
---|
| 1426 |
|
---|
| 1427 | X http://niken8media.logdown.com, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.]
|
---|
| 1428 | X https://www.podrozeady.com, Looks Polish or other East-European language. The NZ page https://www.podrozeady.com/NZ/4/ had placenames detected.
|
---|
| 1429 |
|
---|
| 1430 | X http://www.thesalmons.org, [detection and misdetection of author names of papers hosted]
|
---|
| 1431 |
|
---|
| 1432 | X http://linkvip.top, [.rar and media file links misdetected as MRI]
|
---|
| 1433 |
|
---|
| 1434 |
|
---|
| 1435 | X http://www.lunar-occultations.com, [NZ place names for astronomical phenomena]
|
---|
| 1436 | X http://shangrilapress.net, [NZ placenames]
|
---|
| 1437 | X http://malecek.com, [misdetection CD title]
|
---|
| 1438 | X https://www.blue-frontiers.com, [Tahitian, Reo Tahiti, misdetected as MRI]
|
---|
| 1439 | X http://www.whoisentry.com, [URL names, looked at several which were probably misdetected as MRI]
|
---|
| 1440 | X http://loquevendra318.com, [uses Google translate for auto-translation]
|
---|
| 1441 |
|
---|
| 1442 |
|
---|
| 1443 | ?? http://www.forensicfashion.com, [historical information, useful for CULTURE? e.g. http://www.forensicfashion.com/1807MaoriChief.html]
|
---|
| 1444 |
|
---|
| 1445 | X http://www.eyecontactsite.com, [Lots of names. And a few short sentences or words possibly in comments.]
|
---|
| 1446 | X http://eartheum.com, [Rapa Nui, Easter Island related content. Misdetected]
|
---|
| 1447 | X http://www.steve-wheeler.co.uk, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"]
|
---|
| 1448 | X https://chromium.googlesource.com, [some source code related to languages' two letter codes]
|
---|
| 1449 |
|
---|
| 1450 | X http://www.roadsmile.com, [Lots of misdetection based on word Kia.]
|
---|
| 1451 | ?? https://www.knowatom.com, https://phet.colorado.edu [Similar looking science web sites for children. Uses auto-translation?]
|
---|
| 1452 |
|
---|
| 1453 | X https://www.indexmundi.com, [place names. Pages about Solomon Islands. Misdetection of placenames.]
|
---|
| 1454 |
|
---|
| 1455 |
|
---|
| 1456 |
|
---|
| 1457 | X http://wowwars.net, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ]
|
---|
| 1458 | ?? https://www.hidroponia.org.mx, [Not sure if https://www.hidroponia.org.mx/index.php/idiomas/284-hydroponics-te-ahurea-wai-maori is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?]
|
---|
| 1459 | X http://www.v3whois.com, [URLs are misdetected as MRI]
|
---|
| 1460 | X http://rhymebrain.com, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages]
|
---|
| 1461 |
|
---|
| 1462 |
|
---|
| 1463 | X SINGLE SENTENCE DETECTED (NO MORE AND NOT PAGE:)
|
---|
| 1464 | http://frontrowphotos.com,
|
---|
| 1465 | http://www.pressreader.com,
|
---|
| 1466 | https://www.nccri.ie,
|
---|
| 1467 | http://takethatvacation.com,
|
---|
| 1468 | http://worldradiomap.com,
|
---|
| 1469 | http://www.namesdir.com,
|
---|
| 1470 |
|
---|
| 1471 | X http://www.frogsonline.com, [NZ hotels, placenames]
|
---|
| 1472 | X http://www.geni.com, [Single sentence misdetection]
|
---|
| 1473 | X http://wikiedit.org, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui]
|
---|
| 1474 |
|
---|
| 1475 |
|
---|
[33823] | 1476 |
|
---|
| 1477 | ---------------
|
---|
[33849] | 1478 | All sites except NZ or .nz TLD where containingMRI=true manually inspected. Includes overseas sites with mi in URL path. All NZ sites passed through without inspection.
|
---|
[33823] | 1479 |
|
---|
| 1480 | MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY
|
---|
| 1481 | NZ: 176
|
---|
| 1482 | US: 25
|
---|
| 1483 | AU: 3
|
---|
| 1484 | FR: 1
|
---|
| 1485 | DK: 2
|
---|
| 1486 | (CA: 0.5)
|
---|
| 1487 | DE: 2
|
---|
| 1488 | IE (Ireland): 1
|
---|
| 1489 | CZ: 1
|
---|
| 1490 | ES: 1
|
---|
| 1491 | BG: 1
|
---|
| 1492 |
|
---|
| 1493 | TIDIED:
|
---|
| 1494 | NZ: 176
|
---|
[33847] | 1495 | US: 25+4 from US with mi in URL path = 29
|
---|
[33849] | 1496 | AU: 2
|
---|
[33823] | 1497 | DE: 2
|
---|
| 1498 | DK: 2
|
---|
| 1499 | BG: 1
|
---|
| 1500 | CZ: 1
|
---|
| 1501 | ES: 1
|
---|
| 1502 | FR: 1
|
---|
| 1503 | IE: 1
|
---|
[33849] | 1504 | TOTAL: 213+4 from US with mi in URL path = 216
|
---|
[33823] | 1505 |
|
---|
| 1506 |
|
---|
[33838] | 1507 | ------------------------------
|
---|
| 1508 |
|
---|
| 1509 | Need to inspect all those URLs with mi in URL path (mi.* or */mi) that are not sites with nz TLD or originating in NZ:
|
---|
| 1510 |
|
---|
| 1511 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
|
---|
| 1512 | 472
|
---|
| 1513 |
|
---|
| 1514 | (vs:
|
---|
| 1515 | db.getCollection('Websites').find({$and: [{numPagesInMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count()
|
---|
| 1516 | 209)
|
---|
| 1517 |
|
---|
| 1518 |
|
---|
| 1519 | db.Websites.aggregate([
|
---|
| 1520 | {
|
---|
| 1521 | $match: {
|
---|
| 1522 | $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]
|
---|
| 1523 | }
|
---|
| 1524 | },
|
---|
| 1525 | {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}},
|
---|
| 1526 | { $sort : { count : -1} }
|
---|
| 1527 | ])
|
---|
| 1528 |
|
---|
| 1529 |
|
---|
| 1530 | Of interest or possible interest:
|
---|
| 1531 | US:
|
---|
[33847] | 1532 | !! http://indigenousblogs.com [15/18 blogs work] - has one page in Maori (http://indigenousblogs.com/feeds/mi.xml)
|
---|
[33838] | 1533 | X https://biblia.gospelprime.com.br - misdetection (containsMRI)
|
---|
| 1534 | X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout
|
---|
| 1535 | !! https://mi.m.wikipedia.org, https://mi.wikipedia.org
|
---|
| 1536 | X https://usahello.org - autotranslated
|
---|
| 1537 | X http://church-of-christ.org, http://www.church-of-christ.org - I think autotranslated, because "HET kerken van Christus" at https://church-of-christ.org/nl/ i.p.v. meervoud
|
---|
| 1538 | X https://www.livehoster.com
|
---|
| 1539 | X http://www.americasportsfloor.com, - product store. Misdetected
|
---|
| 1540 | !! http://csunplugged.org, https://www.csunplugged.org - University of Canterbury NZ and site only available in EN, MI, DE, ES, CN
|
---|
| 1541 | X https://mi.lawyers.cafe - autotranslated
|
---|
| 1542 | X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated
|
---|
| 1543 | ! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated.
|
---|
| 1544 | X http://jobdescriptionsample.org - autotranslated
|
---|
| 1545 | X http://mi.broadcastbeat.com - autotranslated product site
|
---|
| 1546 | X http://www.samewe.net - autotranslated product site
|
---|
| 1547 | X https://mi.kidspicturedictionary.com - autotranslated, but MAY BE USEFUL
|
---|
| 1548 | X https://www.rikoooo.com - autotranslated
|
---|
| 1549 |
|
---|
| 1550 | CN: -
|
---|
| 1551 |
|
---|
| 1552 | FR:
|
---|
| 1553 | ? https://mi.phcoker.com - product site "Shangke Chemical Rapu + 86 (1812) 4514114 [email protected]"
|
---|
| 1554 | X http://www.gpedia.com - dodgy copy of wikipedia, see http://www.gpedia.com/nl/gpedia/Hoofdpagina
|
---|
| 1555 |
|
---|
| 1556 | NL:
|
---|
| 1557 | X http://www.martinvrijland.nl - wordpress, autotranslated
|
---|
| 1558 |
|
---|
| 1559 | CA:
|
---|
| 1560 | X https://www.wikiplanet.click (seems like a dodgy copy of wikipedia)
|
---|
| 1561 | X cloudsfeed.com - wordpress admin page
|
---|
| 1562 |
|
---|
| 1563 |
|
---|
| 1564 | db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]})
|
---|
| 1565 | => http://indigenousblogs.com/mi/
|
---|
[33847] | 1566 |
|
---|
| 1567 | --------------------------
|
---|
| 1568 |
|
---|
| 1569 |
|
---|
| 1570 | db.Websites.aggregate([
|
---|
| 1571 | {
|
---|
| 1572 | $match: {
|
---|
| 1573 | $and: [
|
---|
| 1574 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 1575 | {domain: {$not: /\.nz/}},
|
---|
| 1576 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1577 | {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
|
---|
| 1578 | ]
|
---|
| 1579 | }
|
---|
| 1580 | },
|
---|
| 1581 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1582 | {
|
---|
| 1583 | $group: {
|
---|
| 1584 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 1585 | count: { $sum: 1 },
|
---|
| 1586 | domain: { $addToSet: '$domain' },
|
---|
| 1587 | numPagesInMRI: { $addToSet: '$numPagesInMRI' },
|
---|
| 1588 | numPagesContainingMRI: { $addToSet: '$numPagesContainingMRI' },
|
---|
| 1589 | numPagesInMRICount: { $sum: '$numPagesInMRI' },
|
---|
| 1590 | numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
|
---|
| 1591 | }
|
---|
| 1592 | },
|
---|
| 1593 | { $sort : { count : -1} }
|
---|
| 1594 | ]);
|
---|
| 1595 |
|
---|
| 1596 |
|
---|
| 1597 | To convert json to csv
|
---|
| 1598 | In gedit replace
|
---|
| 1599 | \/\*\s*\d+\s*\*\/ => ,
|
---|