[33644] | 1 | MongoDB
|
---|
| 2 | Installation:
|
---|
| 3 | https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
| 4 | https://docs.mongodb.com/manual/administration/install-on-linux/
|
---|
| 5 | https://hevodata.com/blog/install-mongodb-on-ubuntu/
|
---|
| 6 | https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04
|
---|
| 7 | CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/
|
---|
| 8 | FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source
|
---|
| 9 | GUI:
|
---|
| 10 | https://robomongo.org/
|
---|
| 11 | Robomongo is Robo 3T now
|
---|
| 12 |
|
---|
| 13 | https://www.tutorialspoint.com/mongodb/mongodb_java.htm
|
---|
| 14 | JAR FILE:
|
---|
| 15 | http://central.maven.org/maven2/org/mongodb/mongo-java-driver/
|
---|
| 16 | https://mongodb.github.io/mongo-java-driver/
|
---|
| 17 |
|
---|
| 18 |
|
---|
| 19 |
|
---|
| 20 | https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
| 21 | http://www.programmersought.com/article/6500308940/
|
---|
| 22 |
|
---|
| 23 | 52 sudo apt-get install mongodb-clients
|
---|
| 24 | 53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
| 25 |
|
---|
| 26 | Failed with
|
---|
| 27 | Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
|
---|
| 28 | exception: connect failed
|
---|
| 29 |
|
---|
| 30 | This is due to a version incompatibility between Client and mongodb Server.
|
---|
| 31 | The solution is to follow instructions at http://www.programmersought.com/article/6500308940/
|
---|
| 32 | and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
| 33 | as below:
|
---|
| 34 |
|
---|
| 35 | 54 sudo apt-get purge mongodb-clients
|
---|
| 36 | 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
|
---|
| 37 | 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
---|
| 38 | 57 sudo apt-get update
|
---|
| 39 | 58 sudo apt-get install mongodb-clients
|
---|
| 40 | 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
| 41 | (still doesn't work)
|
---|
| 42 | 60 sudo apt-get install -y mongodb-org
|
---|
| 43 | The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server?
|
---|
| 44 | 72 sudo service mongod status
|
---|
| 45 |
|
---|
| 46 | 103 sudo service mongod start
|
---|
| 47 | "mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections
|
---|
| 48 | 104 sudo service mongod status
|
---|
| 49 | 88 sudo service mongod stop
|
---|
| 50 |
|
---|
| 51 |
|
---|
| 52 | DETAILS:
|
---|
| 53 |
|
---|
| 54 | wharariki:[879]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
| 55 |
|
---|
| 56 | didn't work with the pwd. Failed with:
|
---|
| 57 |
|
---|
| 58 | MongoDB shell version: 2.6.10
|
---|
| 59 | Enter password:
|
---|
| 60 | connecting to: mongodb://mongodb.cms.waikato.ac.nz:27017
|
---|
| 61 | 2019-11-04T20:02:47.970+1300 Assertion: 13110:HostAndPort: host is empty
|
---|
| 62 | 2019-11-04T20:02:47.970+1300 0x6b75c9 0x659e9f 0x636f69 0x4fa55c 0x501249 0x4fa7f1 0x6006fd 0x5eb869 0x7f7bfbd47d76 0x1f3c10d06362
|
---|
| 63 | mongo(_ZN5mongo15printStackTraceERSo+0x39) [0x6b75c9]
|
---|
| 64 | mongo(_ZN5mongo10logContextEPKc+0x21f) [0x659e9f]
|
---|
| 65 | mongo(_ZN5mongo11msgassertedEiPKc+0xd9) [0x636f69]
|
---|
| 66 | mongo(_ZN5mongo16ConnectionString12_fillServersENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50c) [0x4fa55c]
|
---|
| 67 | mongo(_ZN5mongo16ConnectionStringC1ENS0_14ConnectionTypeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_+0x99) [0x501249]
|
---|
| 68 | mongo(_ZN5mongo16ConnectionString5parseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERS6_+0x201) [0x4fa7f1]
|
---|
| 69 | mongo(_ZN5mongo17mongoConsExternalEPNS_7V8ScopeERKN2v89ArgumentsE+0x11d) [0x6006fd]
|
---|
| 70 | mongo(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0xa9) [0x5eb869]
|
---|
| 71 | /usr/lib/libv8.so.3.14.5(+0x99d76) [0x7f7bfbd47d76]
|
---|
| 72 | [0x1f3c10d06362]
|
---|
| 73 | 2019-11-04T20:02:47.971+1300 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
|
---|
| 74 | exception: connect failed
|
---|
| 75 |
|
---|
| 76 |
|
---|
| 77 | This is due to a version incompatibility between Client and mongodb Server.
|
---|
| 78 | Can find client version above. (2.6.10)
|
---|
| 79 | Server version can be found by running the mongo client shell. Doing so without loading a db:
|
---|
| 80 |
|
---|
| 81 |
|
---|
| 82 | wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb
|
---|
| 83 | MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION
|
---|
| 84 | type "help" for help
|
---|
| 85 | > help
|
---|
| 86 | db.help() help on db methods
|
---|
| 87 | db.mycoll.help() help on collection methods
|
---|
| 88 | sh.help() sharding helpers
|
---|
| 89 | rs.help() replica set helpers
|
---|
| 90 | help admin administrative help
|
---|
| 91 | help connect connecting to a db help
|
---|
| 92 | help keys key shortcuts
|
---|
| 93 | help misc misc things to know
|
---|
| 94 | help mr mapreduce
|
---|
| 95 |
|
---|
| 96 | show dbs show database names
|
---|
| 97 | show collections show collections in current database
|
---|
| 98 | show users show users in current database
|
---|
| 99 | show profile show most recent system.profile entries with time >= 1ms
|
---|
| 100 | show logs show the accessible logger names
|
---|
| 101 | show log [name] prints out the last segment of log in memory, 'global' is default
|
---|
| 102 | use <db_name> set current database
|
---|
| 103 | db.foo.find() list objects in collection foo
|
---|
| 104 | db.foo.find( { a : 1 } ) list objects in foo where a == 1
|
---|
| 105 | it result of the last line evaluated; use to further iterate
|
---|
| 106 | DBQuery.shellBatchSize = x set default number of items to display on shell
|
---|
| 107 | exit quit the mongo shell
|
---|
| 108 |
|
---|
| 109 | > help connect
|
---|
| 110 |
|
---|
| 111 | Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options.
|
---|
| 112 | Additional connections may be opened:
|
---|
| 113 |
|
---|
| 114 | var x = new Mongo('host[:port]');
|
---|
| 115 | var mydb = x.getDB('mydb');
|
---|
| 116 | or
|
---|
| 117 | var mydb = connect('host[:port]/mydb');
|
---|
| 118 |
|
---|
| 119 | Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
|
---|
| 120 |
|
---|
| 121 | Getting help on connect options:
|
---|
| 122 |
|
---|
| 123 | > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017');
|
---|
| 124 | > var mydb = x.getDB('anupama');
|
---|
| 125 |
|
---|
| 126 | > mydb.connect.help()
|
---|
| 127 | DBCollection help
|
---|
| 128 | db.connect.find().help() - show DBCursor help
|
---|
| 129 | db.connect.count()
|
---|
| 130 | db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
|
---|
| 131 | db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command
|
---|
| 132 | db.connect.dataSize()
|
---|
| 133 | db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' )
|
---|
| 134 | db.connect.drop() drop the collection
|
---|
| 135 | db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } )
|
---|
| 136 | db.connect.dropIndexes()
|
---|
| 137 | db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
|
---|
| 138 | db.connect.reIndex()
|
---|
| 139 | db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
|
---|
| 140 | e.g. db.connect.find( {x:77} , {name:1, x:1} )
|
---|
| 141 | db.connect.find(...).count()
|
---|
| 142 | db.connect.find(...).limit(n)
|
---|
| 143 | db.connect.find(...).skip(n)
|
---|
| 144 | db.connect.find(...).sort(...)
|
---|
| 145 | db.connect.findOne([query])
|
---|
| 146 | db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
|
---|
| 147 | db.connect.getDB() get DB object associated with collection
|
---|
| 148 | db.connect.getPlanCache() get query plan cache associated with collection
|
---|
| 149 | db.connect.getIndexes()
|
---|
| 150 | db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
|
---|
| 151 | db.connect.insert(obj)
|
---|
| 152 | db.connect.mapReduce( mapFunction , reduceFunction , <optional params> )
|
---|
| 153 | db.connect.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
|
---|
| 154 | db.connect.remove(query)
|
---|
| 155 | db.connect.renameCollection( newName , <dropTarget> ) renames the collection.
|
---|
| 156 | db.connect.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
|
---|
| 157 | db.connect.save(obj)
|
---|
| 158 | db.connect.stats()
|
---|
| 159 | db.connect.storageSize() - includes free space allocated to this collection
|
---|
| 160 | db.connect.totalIndexSize() - size in bytes of all the indexes
|
---|
| 161 | db.connect.totalSize() - storage allocated for all data and indexes
|
---|
| 162 | db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi
|
---|
| 163 | db.connect.validate( <full> ) - SLOW
|
---|
| 164 | db.connect.getShardVersion() - only for use with sharding
|
---|
| 165 | db.connect.getShardDistribution() - prints statistics about data distribution in the cluster
|
---|
| 166 | db.connect.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
|
---|
| 167 | db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
|
---|
| 168 | db.connect.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
|
---|
| 169 | db.connect.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
|
---|
| 170 | > mydb.version()
|
---|
| 171 | 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION
|
---|
| 172 |
|
---|
| 173 | (Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb)
|
---|
| 174 |
|
---|
| 175 | Finally we now know the mongodb server version: 4.0.13
|
---|
| 176 | This version doesn't work with our mongo client (shell) version of 2.6.10.
|
---|
| 177 |
|
---|
| 178 |
|
---|
| 179 | DETAILS OF INSTALLING MONGO-CLIENT AND UPDATING IT, AND INSTALLING MONGODB SERVER:
|
---|
| 180 |
|
---|
| 181 |
|
---|
| 182 | 54 sudo apt-get purge mongodb-clients
|
---|
| 183 | 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
|
---|
| 184 | 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
---|
| 185 | 57 sudo apt-get update
|
---|
| 186 | 58 sudo apt-get install mongodb-clients
|
---|
| 187 | 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
| 188 | 60 sudo apt-get install -y mongodb-org
|
---|
| 189 | 61 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
| 190 | 62 sudo service apache2 status
|
---|
| 191 | 63 sudo service sshd status
|
---|
| 192 | 64 sudo service mongodb status
|
---|
| 193 | 65 sudo service mongo status
|
---|
| 194 | 66 mongod
|
---|
| 195 | 67 mongod --help
|
---|
| 196 | 68 mongod --help | less
|
---|
| 197 | 69 mongod -f /etc/mongod.conf
|
---|
| 198 | 70 sudo mongod -f /etc/mongod.conf
|
---|
| 199 | 71 less /etc/mongod.conf
|
---|
| 200 | 72 sudo service mongod status
|
---|
| 201 | 73 sudo service mongod start
|
---|
| 202 | 74 sudo service mongod status
|
---|
| 203 | 75 ls -l /var/log/mongodb/mongod.log
|
---|
| 204 | 76 sudo rm /var/log/mongodb/mongod.log
|
---|
| 205 | 77 sudo service mongod status
|
---|
| 206 | 78 sudo service mongod start
|
---|
| 207 | 79 sudo service mongod status
|
---|
| 208 | 80 sudo service mongod stop
|
---|
| 209 | 81 ps auxww | grep mongo
|
---|
| 210 | 82 sudo service mongod start
|
---|
| 211 | 83 sudo service mongod status
|
---|
| 212 | 84 ps auxww | grep mongo
|
---|
| 213 | 85 sudo dmsg
|
---|
| 214 | 86 sudo dmesg
|
---|
| 215 | 87 sudo service mongod status
|
---|
| 216 | 88 sudo service mongod stop
|
---|
| 217 | 89 sudo service mongod start
|
---|
| 218 | 90 sudo dmesg
|
---|
| 219 | 91 sudo less /var/log/mongodb/mongod.log
|
---|
| 220 | 92 ls /var/lib/
|
---|
| 221 | 93 ls -ld /var/lib/
|
---|
| 222 | 94 ls -l /var/log/mongodb/mongod.log
|
---|
| 223 | 95 ls -ld /var/lib/
|
---|
| 224 | 96 groups mongodb
|
---|
| 225 | 97 less /etc/mongod.conf
|
---|
| 226 | 98 sudo less /var/log/mongodb/mongod.log
|
---|
| 227 | 99 less /etc/mongod.conf
|
---|
| 228 | 100 ls -l /var/lib/mongodb/
|
---|
| 229 | 101 sudo chown -R mongodb /var/lib/mongodb/
|
---|
| 230 | 102 sudo chgrp -R mongodb /var/lib/mongodb/
|
---|
| 231 | 103 sudo service mongod start
|
---|
| 232 | 104 sudo service mongod status
|
---|
| 233 | 105 history
|
---|
| 234 |
|
---|
| 235 |
|
---|
| 236 |
|
---|
| 237 | MONGO DB ROBO 3T
|
---|
| 238 | 1. Download "Double Pack" from https://robomongo.org/
|
---|
| 239 | 2. Untar its contents. Then untar the tarball in that.
|
---|
| 240 | 3. Run:
|
---|
| 241 | wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t
|
---|
| 242 |
|
---|
| 243 | ===================
|
---|
| 244 | On analytics, vagrant node1, we've installed the mongodb server and client.
|
---|
| 245 | We're able to successfully create collections on here.
|
---|
| 246 |
|
---|
| 247 |
|
---|
| 248 | vagrant@node1:~$ mongo
|
---|
| 249 | MongoDB shell version v4.2.1
|
---|
| 250 | connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
|
---|
| 251 | Implicit session: session { "id" : UUID("87bb585c-4685-47f6-bf89-a93801daeb2d") }
|
---|
| 252 | MongoDB server version: 4.2.1
|
---|
| 253 | Server has startup warnings:
|
---|
| 254 | 2019-11-04T07:48:14.197+0000 I STORAGE [initandlisten]
|
---|
| 255 | 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
|
---|
| 256 | 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** See http://dochub.mongodb.org/core/prodnotes-filesystem
|
---|
| 257 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
|
---|
| 258 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database.
|
---|
| 259 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted.
|
---|
| 260 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
|
---|
| 261 | ---
|
---|
| 262 | Enable MongoDB's free cloud-based monitoring service, which will then receive and display
|
---|
| 263 | metrics about your deployment (disk utilization, CPU, operation statistics, etc).
|
---|
| 264 |
|
---|
| 265 | The monitoring data will be available on a MongoDB website with a unique URL accessible to you
|
---|
| 266 | and anyone you share the URL with. MongoDB may use this information to make product
|
---|
| 267 | improvements and to suggest MongoDB products and deployment options to you.
|
---|
| 268 |
|
---|
| 269 | To enable free monitoring, run the following command: db.enableFreeMonitoring()
|
---|
| 270 | To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
|
---|
| 271 | ---
|
---|
| 272 |
|
---|
| 273 | > show dbs
|
---|
| 274 | admin 0.000GB
|
---|
| 275 | config 0.000GB
|
---|
| 276 | local 0.000GB
|
---|
| 277 | > use db ateacrawldata
|
---|
| 278 | 2019-11-05T05:24:20.155+0000 E QUERY [js] Error: [db ateacrawldata] is not a valid database name :
|
---|
| 279 | Mongo.prototype.getDB@src/mongo/shell/mongo.js:51:12
|
---|
| 280 | getDatabase@src/mongo/shell/session.js:913:28
|
---|
| 281 | DB.prototype.getSiblingDB@src/mongo/shell/db.js:22:12
|
---|
| 282 | shellHelper.use@src/mongo/shell/utils.js:803:10
|
---|
| 283 | shellHelper@src/mongo/shell/utils.js:790:15
|
---|
| 284 | @(shellhelp2):1:1
|
---|
| 285 | > db.createCollection('webpages');
|
---|
| 286 | { "ok" : 1 }
|
---|
[33646] | 287 | > db.webpages.drop();
|
---|
[33644] | 288 | ... ^C
|
---|
| 289 |
|
---|
| 290 | > db.webpages.drop();
|
---|
| 291 | true
|
---|
| 292 | > use ateacrawldata
|
---|
| 293 | switched to db ateacrawldata
|
---|
| 294 | > db.createCollection('webpages');
|
---|
| 295 | { "ok" : 1 }
|
---|
| 296 | > show collections
|
---|
| 297 | webpages
|
---|
| 298 | > db.createCollection('websites');
|
---|
| 299 | { "ok" : 1 }
|
---|
| 300 | >
|
---|
| 301 |
|
---|
| 302 | ------------------------
|
---|
| 303 |
|
---|
| 304 | Ask Clint to rename "anupama" database to "ateacrawldata" database following the instructions at:
|
---|
| 305 | https://stackoverflow.com/questions/9201832/how-do-you-rename-a-mongodb-database
|
---|
| 306 | I don't have permissions to do this.
|
---|
| 307 | Nor do I have permissions to create Mongo collections within a new database that I create, like ateacrawldata.
|
---|
| 308 | I only seem to have rights to the "anupama" database.
|
---|
| 309 |
|
---|
| 310 |
|
---|
[33646] | 311 |
|
---|
| 312 | -----------------------
|
---|
[33722] | 313 | Vagrant virtual machine Node1 has the mongodb installed.
|
---|
[33646] | 314 |
|
---|
[33722] | 315 | After doing "vagrant up" on node1 to start node1:
|
---|
| 316 |
|
---|
| 317 | [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh
|
---|
| 318 | vagrant@node1:~$ mongo
|
---|
| 319 | MongoDB shell version v4.2.1
|
---|
| 320 | connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
|
---|
| 321 | 2019-11-13T09:22:46.996+0000 E QUERY [js] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused :
|
---|
| 322 | connect@src/mongo/shell/mongo.js:341:17
|
---|
| 323 | @(connect):2:6
|
---|
| 324 | 2019-11-13T09:22:46.999+0000 F - [main] exception: connect failed
|
---|
| 325 | 2019-11-13T09:22:46.999+0000 E - [main] exiting with code 1
|
---|
| 326 | vagrant@node1:~$ sudo service mongod status
|
---|
| 327 | â mongod.service - MongoDB Database Server
|
---|
| 328 | Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
|
---|
| 329 | Active: inactive (dead)
|
---|
| 330 | Docs: https://docs.mongodb.org/manual
|
---|
| 331 | vagrant@node1:~$ sudo service mongod start
|
---|
| 332 | vagrant@node1:~$ sudo service mongod status
|
---|
| 333 | â mongod.service - MongoDB Database Server
|
---|
| 334 | Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
|
---|
| 335 | Active: active (running) since Wed 2019-11-13 09:24:07 UTC; 2s ago
|
---|
| 336 | Docs: https://docs.mongodb.org/manual
|
---|
| 337 | Main PID: 4383 (mongod)
|
---|
| 338 | Tasks: 32
|
---|
| 339 | Memory: 199.3M
|
---|
| 340 | CPU: 754ms
|
---|
| 341 | CGroup: /system.slice/mongod.service
|
---|
| 342 | ââ4383 /usr/bin/mongod --config /etc/mongod.conf
|
---|
| 343 |
|
---|
| 344 | Nov 13 09:24:07 node1 systemd[1]: Started MongoDB Database Server.
|
---|
| 345 | vagrant@node1:~$
|
---|
| 346 |
|
---|
| 347 |
|
---|
| 348 | So now mongodb is running on node1 on localhost:27017.
|
---|
| 349 |
|
---|
| 350 | Next, in another x-term connected to analytics' node1 Vagrant VM, port forward node1's localhost:27017 to analytics' localhost:27017:
|
---|
| 351 | vagrant ssh -- -L 27017:localhost:27017
|
---|
| 352 |
|
---|
| 353 |
|
---|
| 354 |
|
---|
| 355 | Finally, in another x-term, port-forward from analytics:27017 to current machine's 27017:
|
---|
| 356 | ssh -L 27017:localhost:27017 analytics
|
---|
| 357 |
|
---|
| 358 |
|
---|
| 359 | Now can connect Robo-3T running on current machine to localhost:27017.
|
---|
| 360 |
|
---|
| 361 | Then in a new x-term, can use the client mongo shell to connect (by default to localhost:27017):
|
---|
| 362 |
|
---|
| 363 | wharariki:[122]/Scratch/ak19/GS309>mongo --shell
|
---|
| 364 | MongoDB shell version v4.0.13
|
---|
| 365 | connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
|
---|
| 366 | ...
|
---|
| 367 | > show dbs
|
---|
| 368 | admin 0.000GB
|
---|
| 369 | ateacrawldata 1.532GB
|
---|
| 370 | config 0.000GB
|
---|
| 371 | local 0.000GB
|
---|
| 372 | > use ateacrawldata
|
---|
| 373 |
|
---|
| 374 | > show collections
|
---|
| 375 | Webpages
|
---|
| 376 | Websites
|
---|
| 377 | oldwebpages
|
---|
| 378 | oldwebsites
|
---|
| 379 | -------------------
|
---|
| 380 |
|
---|
| 381 | Country code to geolocation CSV file found by Dr Bainbridge:
|
---|
| 382 | https://developers.google.com/public-data/docs/canonical/countries_csv
|
---|
| 383 |
|
---|
| 384 | Import into mongodb with:
|
---|
| 385 | https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv
|
---|
| 386 |
|
---|
| 387 |
|
---|
| 388 |
|
---|
| 389 | NOTE: mongoimport is a commandline utility and not a command to be run from the mongo shell. See https://jira.mongodb.org/browse/DOCS-11072
|
---|
| 390 | This means, in an x-term, DON'T RUN MONGO SHELL/client first. Instead, directly from x-term, run the following to import the countrycodes.csv file:
|
---|
| 391 |
|
---|
| 392 |
|
---|
| 393 | mongoimport -d ateacrawldata -c countrylocations --type csv --file /Scratch/ak19/maori-lang-detection/MoreReading/countrycodes.csv --headerline
|
---|
| 394 |
|
---|
| 395 |
|
---|
| 396 | -------------------------
|
---|
| 397 |
|
---|
[33646] | 398 | MONGODB QUERIES:
|
---|
| 399 |
|
---|
| 400 | db.getCollection('webpages').find({"isMRI": true, "singleSentences.langCode": "mri"})
|
---|
| 401 | db.getCollection('webpages').find({"singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"})
|
---|
[33653] | 402 | db.getCollection('Webpages').find({"isMRI": true, "singleSentences": { $elemMatch: {"langCode":"eng"} } }, {"singleSentences.$": "eng"}) [single English lang sentence]
|
---|
| 403 | db.getCollection('Webpages').find({"containsMRI": true, "singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) [gets 1st sentence of docs which have sentences containing MRI]
|
---|
[33646] | 404 |
|
---|
| 405 |
|
---|
| 406 | READING
|
---|
| 407 |
|
---|
| 408 | mongodb java convert class
|
---|
| 409 | https://www.quora.com/What-are-the-ways-of-converting-a-Java-object-to-a-MongoDB-document-and-vice-versa
|
---|
| 410 | https://stackoverflow.com/questions/39320825/pojo-to-org-bson-document-and-vice-versa
|
---|
| 411 | X https://mongodb.github.io/morphia/
|
---|
| 412 | https://stackoverflow.com/questions/10170506/inserting-java-object-to-mongodb-collection-using-java
|
---|
| 413 | X https://www.google.com/search?q=morphia+example&oq=morphia+example&aqs=chrome.0.0l6.4223j0j9&sourceid=chrome&ie=UTF-8
|
---|
| 414 | https://www.baeldung.com/mongodb-morphia
|
---|
| 415 | X https://web.archive.org/web/20171117121335/http://mongodb.github.io/morphia/1.3/getting-started/
|
---|
| 416 | => https://morphia.dev/1.4/getting-started/quick-tour/
|
---|
| 417 | https://github.com/MorphiaOrg/morphia/tree/master/docs/reference
|
---|
| 418 |
|
---|
| 419 |
|
---|
| 420 | mongodb querying
|
---|
| 421 | https://docs.mongodb.com/manual/tutorial/query-embedded-documents/
|
---|
| 422 | https://docs.mongodb.com/manual/tutorial/query-arrays/
|
---|
| 423 | https://www.google.com/search?q=mongodb+find+subdocument&oq=mongodb+find+&aqs=chrome.0.69i59j69i57j0l4.7607j1j8&sourceid=chrome&ie=UTF-8
|
---|
| 424 | https://stackoverflow.com/questions/25586901/how-to-find-document-and-single-subdocument-matching-given-criterias-in-mongodb
|
---|
| 425 | https://stackoverflow.com/questions/21113543/mongodb-get-subdocument
|
---|
| 426 | https://stackoverflow.com/questions/36948856/find-subdocuments-in-mongo
|
---|
| 427 | https://docs.mongodb.com/v3.0/reference/operator/projection/positional/#proj._S_
|
---|
| 428 | https://www.google.com/search?q=mongodb+query+tutorial&oq=mongodb+query+tutorial&aqs=chrome..69i57j0l2j69i60l3.4719j0j7&sourceid=chrome&ie=UTF-8
|
---|
| 429 | https://blog.exploratory.io/an-introduction-to-mongodb-query-for-beginners-bd463319aa4c
|
---|
| 430 | https://docs.mongodb.com/manual/reference/method/db.collection.find/
|
---|
| 431 | https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection
|
---|
[33698] | 432 | https://stackoverflow.com/questions/39641925/mongodb-aggregation-framework-to-get-frequencies-of-fields-values
|
---|
[33666] | 433 |
|
---|
[33698] | 434 | https://exploratory.io/note/kanaugust/0961813761939766
|
---|
| 435 | https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/
|
---|
| 436 | https://docs.mongodb.com/manual/aggregation/
|
---|
| 437 |
|
---|
| 438 |
|
---|
[33675] | 439 | Mongo Studio 3T documentation:
|
---|
| 440 | https://studio3t.com/download/ (also has uninstall information)
|
---|
| 441 | https://studio3t.com/download-thank-you/?OS=x64
|
---|
[33666] | 442 |
|
---|
[33675] | 443 | Google: MongoDB visualization
|
---|
| 444 | MongoDB visualization map
|
---|
| 445 | MongoDB Charts
|
---|
| 446 | (Open source visualisation tools)
|
---|
| 447 |
|
---|
| 448 | json map visualizer
|
---|
| 449 | geojson.tools
|
---|
[33666] | 450 | -------------------
|
---|
| 451 |
|
---|
| 452 | Some queries with results:
|
---|
| 453 |
|
---|
| 454 | # Num websites
|
---|
| 455 | db.getCollection('Websites').find({}).count()
|
---|
[33804] | 456 | 1445
|
---|
[33666] | 457 |
|
---|
| 458 | # Num webpages
|
---|
| 459 | db.getCollection('Webpages').find({}).count()
|
---|
[33675] | 460 | X75139
|
---|
| 461 | 117496
|
---|
[33666] | 462 |
|
---|
[33813] | 463 | # Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI)
|
---|
[33666] | 464 | db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
|
---|
| 465 | 361
|
---|
| 466 |
|
---|
[33804] | 467 | # Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
|
---|
| 468 | db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
|
---|
| 469 | 868
|
---|
| 470 |
|
---|
| 471 | # Obviously, the union of the above two will be identical to numPagesContainingMRI:
|
---|
| 472 | db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
|
---|
| 473 | 868
|
---|
| 474 |
|
---|
[33666] | 475 | # Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
|
---|
| 476 | db.getCollection('Webpages').find({isMRI:true}).count()
|
---|
| 477 | X5224
|
---|
[33675] | 478 | X5215
|
---|
| 479 | db.getCollection('Webpages').find({isMRI:true}).count()
|
---|
| 480 | 7818
|
---|
[33666] | 481 |
|
---|
| 482 | # Number of pages that contain any number of MRI sentences
|
---|
| 483 | db.getCollection('Webpages').find({containsMRI: true}).count()
|
---|
[33675] | 484 | X12858
|
---|
| 485 | 20371
|
---|
[33666] | 486 |
|
---|
[33675] | 487 |
|
---|
[33666] | 488 | # Number of sites with URLs containing /mi(/)
|
---|
[33800] | 489 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
|
---|
[33813] | 490 | X 153
|
---|
| 491 | # Number of sites with URLs containing /mi(/) OR http(s)://mi.*
|
---|
| 492 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
|
---|
| 493 | 670
|
---|
[33666] | 494 |
|
---|
| 495 | # Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls
|
---|
[33800] | 496 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
|
---|
[33813] | 497 | X 147
|
---|
| 498 | # Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* in any of its sub-urls
|
---|
| 499 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
|
---|
| 500 | 656
|
---|
[33666] | 501 |
|
---|
[33813] | 502 | # 6 sites with URLs containing /mi(/) that are in NZ
|
---|
[33800] | 503 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count()
|
---|
[33813] | 504 | X 6
|
---|
| 505 | # 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ
|
---|
| 506 | 14
|
---|
[33666] | 507 |
|
---|
[33804] | 508 |
|
---|
[33666] | 509 | # sort websites that contain /mi(/) in path by geoLocationCountryCode
|
---|
| 510 | # https://www.quackit.com/mongodb/tutorial/mongodb_sort_query_results.cfm
|
---|
[33800] | 511 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).sort({geoLocationCountryCode: 1})
|
---|
[33666] | 512 |
|
---|
[33675] | 513 | Actually, I want to sort by count. See https://docs.mongodb.com/manual/reference/operator/aggregation/sortByCount/
|
---|
[33666] | 514 |
|
---|
[33675] | 515 |
|
---|
[33698] | 516 | # PROJECTION:
|
---|
[33800] | 517 | db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInPath: 1})
|
---|
[33675] | 518 |
|
---|
[33698] | 519 | https://docs.mongodb.com/manual/aggregation/
|
---|
[33710] | 520 | EXAMPLE:
|
---|
[33698] | 521 | db.orders.aggregate([
|
---|
| 522 | { $match: { status: "A" } },
|
---|
| 523 | { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
|
---|
| 524 | ])
|
---|
| 525 |
|
---|
[33710] | 526 | X db.Websites.aggregate([{ $match:{urlContainsLangCodeInPath:true}}, $group: {geoLocationCountryCode:1, total: $count}])
|
---|
[33698] | 527 |
|
---|
[33710] | 528 |
|
---|
| 529 | X db.Websites.aggregate([
|
---|
| 530 | { $match:{urlContainsLangCodeInPath:true}},
|
---|
| 531 | {$group: {geoLocationCountryCode:1}}
|
---|
| 532 | ])
|
---|
| 533 |
|
---|
| 534 | WORKS (but an "unwind" will get rid of "null"):
|
---|
| 535 | db.Websites.aggregate([
|
---|
| 536 | { $match:{urlContainsLangCodeInPath:true}},
|
---|
| 537 | {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}}},
|
---|
| 538 | { $sort : { count : -1} }
|
---|
| 539 | ])
|
---|
| 540 |
|
---|
| 541 |
|
---|
| 542 | # COUNT OF ALL GEOLOCATION COUNTRIES
|
---|
| 543 | #https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
|
---|
| 544 | # LIST
|
---|
| 545 | db.Websites.distinct('geoLocationCountryCode');
|
---|
| 546 |
|
---|
| 547 | # COUNT
|
---|
| 548 | db.Websites.distinct('geoLocationCountryCode').length;
|
---|
| 549 |
|
---|
| 550 | # A COUNT WITH QUERY - https://docs.mongodb.com/manual/reference/command/distinct/#dbcmd.distinct
|
---|
| 551 |
|
---|
| 552 | db.runCommand ( { distinct: "Websites", key: "geoLocationCountryCode", query: { "urlContainsLangCodeInPath": true} } );
|
---|
| 553 |
|
---|
| 554 | # DISTINCT WITH QUERY WITHOUT COUNT - https://docs.mongodb.com/manual/reference/method/db.collection.distinct/
|
---|
| 555 | db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true});
|
---|
| 556 |
|
---|
| 557 | #SORTED - https://stackoverflow.com/questions/4759437/get-distinct-values-with-sorted-data
|
---|
| 558 | db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}).sort();
|
---|
| 559 |
|
---|
| 560 |
|
---|
[33787] | 561 | # count of all sites for which the geolocation is UNKNOWN
|
---|
| 562 | db.getCollection('Websites').find({geoLocationCountryCode: {$eq:"UNKNOWN"}}).count()
|
---|
| 563 |
|
---|
| 564 |
|
---|
[33710] | 565 | # AGGREGATION QUERIES THAT WORK:
|
---|
| 566 | #https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
|
---|
| 567 |
|
---|
[33787] | 568 | WORKS:
|
---|
| 569 | // count of country codes for all sites
|
---|
[33710] | 570 | db.Websites.aggregate([
|
---|
[33787] | 571 |
|
---|
| 572 | { $unwind: "$geoLocationCountryCode" },
|
---|
[33710] | 573 | {
|
---|
[33787] | 574 | $group: {
|
---|
| 575 | _id: "$geoLocationCountryCode",
|
---|
| 576 | count: { $sum: 1 }
|
---|
| 577 | }
|
---|
| 578 | },
|
---|
| 579 | { $sort : { count : -1} }
|
---|
| 580 | ]);
|
---|
| 581 |
|
---|
[33804] | 582 | // count of country codes for sites that have at least one page detected as MRI
|
---|
[33787] | 583 |
|
---|
[33804] | 584 | db.Websites.aggregate([
|
---|
| 585 | {
|
---|
| 586 | $match: {
|
---|
| 587 | numPagesInMRI: {$gt: 0}
|
---|
| 588 | }
|
---|
| 589 | },
|
---|
| 590 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 591 | {
|
---|
| 592 | $group: {
|
---|
| 593 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 594 | count: { $sum: 1 }
|
---|
| 595 | }
|
---|
| 596 | },
|
---|
| 597 | { $sort : { count : -1} }
|
---|
| 598 | ]);
|
---|
| 599 |
|
---|
| 600 | // count of country codes for sites that have at least one page containing at least one sentence detected as MRI
|
---|
| 601 | db.Websites.aggregate([
|
---|
| 602 | {
|
---|
| 603 | $match: {
|
---|
| 604 | numPagesContainingMRI: {$gt: 0}
|
---|
| 605 | }
|
---|
| 606 | },
|
---|
| 607 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 608 | {
|
---|
| 609 | $group: {
|
---|
| 610 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 611 | count: { $sum: 1 }
|
---|
| 612 | }
|
---|
| 613 | },
|
---|
| 614 | { $sort : { count : -1} }
|
---|
| 615 | ]);
|
---|
| 616 |
|
---|
| 617 |
|
---|
[33787] | 618 | WORKS:
|
---|
[33813] | 619 | // count of country codes for sites that have /mi(/) or http(s)://mi.* in URL path
|
---|
[33787] | 620 |
|
---|
| 621 | db.Websites.aggregate([
|
---|
| 622 | {
|
---|
[33710] | 623 | $match: {
|
---|
| 624 | urlContainsLangCodeInPath: true
|
---|
| 625 | }
|
---|
| 626 | },
|
---|
| 627 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 628 | {
|
---|
| 629 | $group: {
|
---|
| 630 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 631 | count: { $sum: 1 }
|
---|
| 632 | }
|
---|
| 633 | },
|
---|
[33722] | 634 | { $sort : { count : -1} }
|
---|
[33710] | 635 | ]);
|
---|
| 636 |
|
---|
| 637 |
|
---|
| 638 | WORKS:
|
---|
| 639 | db.Websites.aggregate([
|
---|
| 640 | {
|
---|
| 641 | $match: {
|
---|
| 642 | geoLocationCountryCode: {$ne : "UNKNOWN"}
|
---|
| 643 | }
|
---|
| 644 | },
|
---|
| 645 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 646 | {
|
---|
| 647 | $group: {
|
---|
| 648 | _id: "$geoLocationCountryCode",
|
---|
| 649 | count: { $sum: 1 }
|
---|
| 650 | }
|
---|
| 651 | },
|
---|
[33722] | 652 | { $sort : { count : -1} }
|
---|
[33710] | 653 | ]);
|
---|
| 654 |
|
---|
| 655 | WORKS:
|
---|
| 656 | db.Websites.aggregate([
|
---|
| 657 | {
|
---|
| 658 | $match: {
|
---|
| 659 | "urlContainsLangCodeInPath": true
|
---|
| 660 | }
|
---|
| 661 | },
|
---|
| 662 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 663 | {
|
---|
| 664 | $group: {
|
---|
| 665 | _id: "$geoLocationCountryCode",
|
---|
| 666 | count: { $sum: 1 }
|
---|
| 667 | }
|
---|
| 668 | },
|
---|
[33722] | 669 | { $sort : { count : -1} }
|
---|
[33710] | 670 | ]);
|
---|
| 671 |
|
---|
| 672 |
|
---|
| 673 | KEEP ADDITIONAL FIELDS - https://stackoverflow.com/questions/16662405/mongo-group-query-how-to-keep-fields:
|
---|
| 674 | a. KEEPS ONLY FIRST DOMAIN URL FOR EACH COUNTED COUNTRY CODE:
|
---|
| 675 |
|
---|
| 676 | db.Websites.aggregate([
|
---|
| 677 | {
|
---|
| 678 | $match: {
|
---|
| 679 | "urlContainsLangCodeInPath": true
|
---|
| 680 | }
|
---|
| 681 | },
|
---|
| 682 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 683 | {
|
---|
| 684 | $group: {
|
---|
| 685 | _id: "$geoLocationCountryCode", count: { $sum: 1 },
|
---|
| 686 | domain: {$first: '$domain'}
|
---|
| 687 | }
|
---|
| 688 | },
|
---|
| 689 | { $sort : { count : -1} }
|
---|
| 690 | ]);
|
---|
| 691 |
|
---|
| 692 | b. KEEP ALL DOMAIN URLS:
|
---|
| 693 | db.Websites.aggregate([
|
---|
| 694 | {
|
---|
| 695 | $match: {
|
---|
| 696 | "urlContainsLangCodeInPath": true
|
---|
| 697 | }
|
---|
| 698 | },
|
---|
| 699 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 700 | {
|
---|
| 701 | $group: {
|
---|
| 702 | _id: "$geoLocationCountryCode",
|
---|
| 703 | count: { $sum: 1 },
|
---|
| 704 | domain: { $addToSet: '$domain' }
|
---|
| 705 | }
|
---|
| 706 | },
|
---|
| 707 | { $sort : { count : -1} }
|
---|
| 708 | ]);
|
---|
| 709 |
|
---|
| 710 |
|
---|
| 711 | # WANT TO GET THE ABOVE INTO WORLD MAP, use geojson.tools found by Dr Bainbridge
|
---|
| 712 | geojson.tools
|
---|
| 713 | USAGE: https://www.here.xyz/viewer-tool/
|
---|
| 714 |
|
---|
| 715 |
|
---|
[33698] | 716 | AIMS:
|
---|
[33675] | 717 | * Identify where Maori language is online.
|
---|
| 718 | * How can we identify high quality sites that would be good for a corpus.
|
---|
| 719 | (Related work for other languages to quantifiably answer that)
|
---|
| 720 |
|
---|
[33806] | 721 | data-preparation
|
---|
| 722 | docs
|
---|
[33698] | 723 |
|
---|
| 724 |
|
---|
[33806] | 725 | ------------------------------------------
|
---|
[33698] | 726 |
|
---|
[33806] | 727 | BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
|
---|
| 728 | ---
|
---|
[33698] | 729 |
|
---|
[33806] | 730 | # https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or
|
---|
| 731 | # https://docs.mongodb.com/manual/reference/operator/query/and/
|
---|
[33710] | 732 |
|
---|
[33806] | 733 | # 1. all the websites which are from NZ:
|
---|
| 734 | db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count()
|
---|
| 735 | 128
|
---|
[33710] | 736 |
|
---|
[33806] | 737 | # 2. all the websites that have /mi in URL path which are from NZ:
|
---|
| 738 | db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]})
|
---|
| 739 | 6
|
---|
[33710] | 740 |
|
---|
[33806] | 741 | # 3. all the websites that don't have /mi in URLpath
|
---|
| 742 | db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count()
|
---|
| 743 | 1292
|
---|
| 744 |
|
---|
| 745 | # 4. all the websites that don't have /mi, or if they do are from NZ
|
---|
| 746 | # (should be the sum of the above points 2 and 3 above)
|
---|
| 747 | db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count()
|
---|
| 748 | 1298
|
---|
| 749 |
|
---|
| 750 | # 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ
|
---|
| 751 | # These are the TENTATIVE NON-PRODUCT SITES
|
---|
| 752 | # Should be less than the point 4, but more than 1 to 3
|
---|
[33813] | 753 |
|
---|
[33806] | 754 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count()
|
---|
[33813] | 755 | X 859
|
---|
[33806] | 756 |
|
---|
[33813] | 757 | Now with http(s)://mi.* also excluded, the above query returns a count of:
|
---|
| 758 | 389
|
---|
| 759 |
|
---|
| 760 |
|
---|
| 761 | BUT THIS IS THE CORRECT VERSION OF THE QUERY:
|
---|
| 762 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {urlContainsLangCodeInPath: false}]}]}).count()
|
---|
| 763 | 389
|
---|
| 764 |
|
---|
| 765 |
|
---|
[33806] | 766 | # 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix)
|
---|
| 767 | # Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
|
---|
| 768 | db.Websites.aggregate([
|
---|
| 769 | {
|
---|
| 770 | $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}
|
---|
| 771 | },
|
---|
| 772 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 773 | {
|
---|
| 774 | $group: {
|
---|
| 775 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 776 | count: { $sum: 1 }
|
---|
| 777 | }
|
---|
| 778 | },
|
---|
| 779 | { $sort : { count : -1} }
|
---|
| 780 | ]);
|
---|
| 781 |
|
---|
| 782 | The result is very close to the same aggregate on just numPagesContainingMRI.
|
---|
| 783 |
|
---|
| 784 | That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few:
|
---|
| 785 |
|
---|
| 786 | db.Websites.aggregate([
|
---|
| 787 | {
|
---|
| 788 | $match: {
|
---|
| 789 | $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]
|
---|
| 790 | }
|
---|
| 791 | },
|
---|
| 792 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 793 | {
|
---|
| 794 | $group: {
|
---|
| 795 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 796 | count: { $sum: 1 }
|
---|
| 797 | }
|
---|
| 798 | },
|
---|
| 799 | { $sort : { count : -1} }
|
---|
| 800 | ]);
|
---|
| 801 |
|
---|
| 802 |
|
---|
| 803 | _id count
|
---|
| 804 | us 4.0
|
---|
| 805 | nz 4.0
|
---|
| 806 | au 3.0
|
---|
| 807 | ru 1.0
|
---|
| 808 | de 1.0
|
---|
| 809 |
|
---|
| 810 | Total: 13 sites that have /mi/ and are detected as having MRI content,
|
---|
| 811 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
|
---|
| 812 | 13
|
---|
| 813 |
|
---|
| 814 | Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI.
|
---|
| 815 |
|
---|
| 816 |
|
---|
| 817 | Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD!
|
---|
[33710] | 818 | /* 1 */
|
---|
| 819 | {
|
---|
[33806] | 820 | "_id" : "nz",
|
---|
| 821 | "count" : 4.0,
|
---|
| 822 | "domain" : [
|
---|
| 823 | "http://firstworldwar.tki.org.nz",
|
---|
| 824 | "http://www.firstworldwar.tki.org.nz",
|
---|
| 825 | "https://admin.teara.govt.nz",
|
---|
| 826 | "http://community.nzdl.org"
|
---|
| 827 | ]
|
---|
| 828 | }
|
---|
| 829 |
|
---|
| 830 | /* 2 */
|
---|
| 831 | {
|
---|
| 832 | "_id" : "us",
|
---|
| 833 | "count" : 4.0,
|
---|
| 834 | "domain" : [
|
---|
| 835 | "https://sexualviolence.victimsinfo.govt.nz",
|
---|
| 836 | "https://follow3rs.com",
|
---|
| 837 | "http://www.church-of-christ.org",
|
---|
| 838 | "http://www.mytrickstips.com"
|
---|
| 839 | ]
|
---|
| 840 | }
|
---|
| 841 |
|
---|
| 842 | /* 3 */
|
---|
| 843 | {
|
---|
| 844 | "_id" : "au",
|
---|
| 845 | "count" : 3.0,
|
---|
| 846 | "domain" : [
|
---|
| 847 | "https://rapuatearatika.education.govt.nz",
|
---|
| 848 | "https://www.kiwiproperty.com",
|
---|
| 849 | "https://curriculumtool.education.govt.nz"
|
---|
| 850 | ]
|
---|
| 851 | }
|
---|
| 852 |
|
---|
| 853 | /* 4 */
|
---|
| 854 | {
|
---|
| 855 | "_id" : "ru",
|
---|
| 856 | "count" : 1.0,
|
---|
| 857 | "domain" : [
|
---|
| 858 | "http://www.treningmozga.com"
|
---|
| 859 | ]
|
---|
| 860 | }
|
---|
| 861 |
|
---|
| 862 | /* 5 */
|
---|
| 863 | {
|
---|
| 864 | "_id" : "de",
|
---|
| 865 | "count" : 1.0,
|
---|
| 866 | "domain" : [
|
---|
| 867 | "http://www.almancax.com" # Website to learn German, autotranslated
|
---|
| 868 | ]
|
---|
| 869 | }
|
---|
| 870 |
|
---|
| 871 |
|
---|
| 872 | But we're not catching a potentially large number of auto-translated sites, like
|
---|
| 873 | - https://www.gigalight.com/all-languages.html
|
---|
| 874 | - http://www.hzhinew.com/
|
---|
| 875 |
|
---|
[33807] | 876 | https://culturesconnection.com/manual-or-automatic-translation/
|
---|
| 877 | Manual Or Automatic Translation?
|
---|
[33806] | 878 |
|
---|
[33807] | 879 | Automatic translation continues to improve day by day. However, it is still unable to reach perfect levels of accuracy and lacks a natural feel. Will it ever replace human translation?
|
---|
| 880 |
|
---|
[33806] | 881 | --------------
|
---|
[33807] | 882 | Mr Bill Rogers' suggestions for beginnings of trying to sieve out the auto-translated sites:
|
---|
| 883 | - skip .com. .co.<tld>. But .co.nz is also used for non-commercial sites or sites that nevertheless have high quality Maori language content.
|
---|
| 884 | - change cut-off value of OpenNLP language prediction? But for sentences and overlapping sentences, we're not using the cut-off value, we're just checking the best predicted language regardless of confidence level for this.
|
---|
[33806] | 885 |
|
---|
[33807] | 886 | - Code for (a range of) loading of language options in auto-translated sites?
|
---|
[33806] | 887 |
|
---|
[33807] | 888 | ====================
|
---|
[33806] | 889 |
|
---|
[33807] | 890 | # https://stackoverflow.com/questions/20175122/how-can-i-use-not-like-operator-in-mongodb
|
---|
[33806] | 891 |
|
---|
[33807] | 892 | Info on the sites with Maori language content that are either from NZ or have .nz domain (TLD):
|
---|
[33806] | 893 |
|
---|
[33807] | 894 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]})
|
---|
| 895 |
|
---|
| 896 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}).count()
|
---|
| 897 | 183
|
---|
[33806] | 898 |
|
---|
[33807] | 899 | Inverse: the sites detected as containing at least 1 Maori language sentence that are NOT from NZ NOR have .nz domain ending (TLD):
|
---|
| 900 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}]}).count()
|
---|
| 901 | 685
|
---|
[33806] | 902 |
|
---|
[33807] | 903 | The above two figures correctly add up to a total of 868 sites, which is the number of sites detected as containing at least 1 sentence in MRI:
|
---|
| 904 | db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
|
---|
| 905 | 868
|
---|
[33806] | 906 |
|
---|
[33807] | 907 | Without those with /mi in path:
|
---|
| 908 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]}).count()
|
---|
[33806] | 909 |
|
---|
[33807] | 910 | Now let's get a listing of all 685 sites to be manually inspected to determine whether they're auto-translated:
|
---|
[33806] | 911 |
|
---|
[33807] | 912 | /*
|
---|
| 913 | db.Websites.aggregate([
|
---|
| 914 | {
|
---|
| 915 | $match: {
|
---|
| 916 | $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]
|
---|
| 917 | }
|
---|
| 918 | },
|
---|
| 919 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 920 | {
|
---|
| 921 | $group: {
|
---|
| 922 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 923 | count: { $sum: 1 },
|
---|
| 924 | domain: { $addToSet: '$domain' }
|
---|
| 925 | }
|
---|
| 926 | },
|
---|
| 927 | { $sort : { count : -1} }
|
---|
| 928 | ]);
|
---|
[33710] | 929 | */
|
---|
[33807] | 930 | db.Websites.aggregate([
|
---|
| 931 | {
|
---|
| 932 | $match: {
|
---|
[33813] | 933 | $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}}]
|
---|
[33807] | 934 | }
|
---|
| 935 | },
|
---|
| 936 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 937 | {
|
---|
| 938 | $group: {
|
---|
| 939 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 940 | count: { $sum: 1 },
|
---|
| 941 | domain: { $addToSet: '$domain' }
|
---|
| 942 | }
|
---|
| 943 | },
|
---|
| 944 | { $sort : { count : -1} }
|
---|
| 945 | ]);
|
---|
[33710] | 946 |
|
---|
[33813] | 947 |
|
---|
| 948 | We can knock of another 54 non-NZ sites with our new urlContainsLangCodeInPathPrefix field:
|
---|
| 949 |
|
---|
| 950 | db.getCollection('Websites').find({urlContainsLangCodeInPathPrefix: true, geoLocationCountryCode: {$ne: "NZ"}, domain: {$not: /.nz$/}}).count()
|
---|
| 951 | 54
|
---|
| 952 |
|
---|
| 953 |
|
---|
| 954 | SO, can repeat query with new field "urlContainsLangCodeInPathPrefix":
|
---|
| 955 | Number of sites containing >= 1 MRI sentences that are not from NZ or of .nz TLD and which don't contain "/mi(/)" or "http(s)://mi." in URL path:
|
---|
| 956 | db.getCollection('Websites').find({$and: [
|
---|
| 957 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 958 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 959 | {domain: {$not: /.nz$/}},
|
---|
| 960 | {urlContainsLangCodeInPathSuffix: {$ne: true}},
|
---|
| 961 | {urlContainsLangCodeInPathPrefix: {$ne: true}}
|
---|
| 962 | ]}).count()
|
---|
| 963 |
|
---|
| 964 | 651
|
---|
| 965 |
|
---|
| 966 |
|
---|
| 967 | REDO THE COUNT BY COUNTRY QUERY FOR THIS:
|
---|
| 968 |
|
---|
| 969 | db.Websites.aggregate([
|
---|
| 970 | {
|
---|
| 971 | $match: {
|
---|
| 972 | $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPathSuffix: {$ne: true}}, {urlContainsLangCodeInPathPrefix: {$ne: true}}]
|
---|
| 973 | }
|
---|
| 974 | },
|
---|
| 975 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 976 | {
|
---|
| 977 | $group: {
|
---|
| 978 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 979 | count: { $sum: 1 },
|
---|
| 980 | domain: { $addToSet: '$domain' }
|
---|
| 981 | }
|
---|
| 982 | },
|
---|
| 983 | { $sort : { count : -1} }
|
---|
| 984 | ]);
|
---|
| 985 |
|
---|
| 986 |
|
---|
| 987 | AFTER BUGFIX FOR miInURLPath being set at the correct now:
|
---|
| 988 | db.getCollection('Websites').find(
|
---|
| 989 | {$and: [
|
---|
| 990 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 991 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 992 | {domain: {$not: /.nz$/}},
|
---|
| 993 | {urlContainsLangCodeInPath: {$ne: true}}
|
---|
| 994 | ]}).count()
|
---|
| 995 |
|
---|
| 996 | 220
|
---|
| 997 |
|
---|
| 998 | db.Websites.aggregate([
|
---|
| 999 | {
|
---|
| 1000 | $match: {
|
---|
| 1001 | $and: [
|
---|
| 1002 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1003 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 1004 | {domain: {$not: /.nz$/}},
|
---|
| 1005 | {urlContainsLangCodeInPath: {$ne: true}}
|
---|
| 1006 | ]
|
---|
| 1007 | }
|
---|
| 1008 | },
|
---|
| 1009 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1010 | {
|
---|
| 1011 | $group: {
|
---|
| 1012 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 1013 | count: { $sum: 1 },
|
---|
| 1014 | domain: { $addToSet: '$domain' }
|
---|
| 1015 | }
|
---|
| 1016 | },
|
---|
| 1017 | { $sort : { count : -1} }
|
---|
| 1018 | ]);
|
---|
| 1019 |
|
---|
| 1020 | Can inspect websites' pages for whether it's relevant/auto-translated as follows:
|
---|
| 1021 | db.getCollection('Webpages').find({URL:/svenkirsten.com/, mriSentenceCount: {$gt: 0}})
|
---|
| 1022 |
|
---|
| 1023 |
|
---|
[33807] | 1024 | * CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/
|
---|
| 1025 | BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/
|
---|
[33710] | 1026 |
|
---|
[33816] | 1027 | * FR: 16 sites from FR
|
---|
| 1028 | http://blueheavenisland.com, http://www.blueheavenisland.com - misdetection. French Polynesia
|
---|
[33807] | 1029 | https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway
|
---|
| 1030 | http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori.
|
---|
| 1031 | !! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
|
---|
| 1032 | http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names
|
---|
[33816] | 1033 | X http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
|
---|
| 1034 | http://mahajana.net - misdetected a Japanese Zen Buddhist chant as MRI
|
---|
| 1035 | http://rapanui.fr - Rapa Nui Easter Island. Misdetected.
|
---|
| 1036 | http://www.gif.ovh - autotranslated pages. Supposedly a GIF repository
|
---|
| 1037 | http://baladeornithologique.com - misdetection of the word "Retour"
|
---|
| 1038 | http://www.gaudry.be - misdetection of Japanese hiragana etc, and French "faire", as MRI
|
---|
| 1039 | http://www.gototahiti.net - probably misdetection, see title
|
---|
| 1040 | http://www.maraamusurfskirace.com - Bora Bora, French Polynesia. Misdetected.
|
---|
| 1041 | http://www.rongo-rongo.com - appears to be related to Easter Island. Just 1 sentence however.
|
---|
| 1042 | http://pt.city-usa.net - misdetection. Hawaii.
|
---|
| 1043 | https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages.
|
---|
| 1044 | NL:
|
---|
[33823] | 1045 | (!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm]
|
---|
[33816] | 1046 | - https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL
|
---|
| 1047 | - tonhut.nl - misidentication
|
---|
| 1048 | ? http://nielsonboutique.co.uk, http://longhornlaw.net, http://tetsubo.org, http://hidsonphoto.com, http://wearehomework.com/- Feels autotranslated, but no language options visible. All SEO related
|
---|
| 1049 | - diverosa.com - Rapa Nui, Easter Island
|
---|
| 1050 | - nonlinear.demon.nl - misidentified
|
---|
| 1051 | - encyclo.co.uk - misidentification
|
---|
| 1052 | - henrifloor.nl - misidentification
|
---|
| 1053 | - http://skimap.info/ - maps, NZ placenames in PDF
|
---|
| 1054 | DK:
|
---|
[33823] | 1055 | !! ++ http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com,
|
---|
[33816] | 1056 | http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com,
|
---|
| 1057 | http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com
|
---|
| 1058 | - http://www.rennertweb.de - a photogallery page mentioning NZ placenames
|
---|
| 1059 | CA:
|
---|
| 1060 | - http://bcmarina.com AND http://bckayak.com - photos with Canadian placenames
|
---|
| 1061 | - http://www.myrasplace.net - pagse of photos, captions involving NZ placenames
|
---|
| 1062 | ~ http://00.gs/Maniapoto;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing
|
---|
| 1063 | - aguadilla.airport-authority.com - misidentification
|
---|
| 1064 | - https://articles.imperialtometric.com - misidentification
|
---|
| 1065 | - http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames
|
---|
[33813] | 1066 | DE:
|
---|
[33816] | 1067 | - http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful
|
---|
| 1068 | !! https://www.cartogiraffe.com/ and https://www.cartogiraffe.com - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
|
---|
[33813] | 1069 | ~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden"
|
---|
| 1070 | - herocity - autotranslated
|
---|
| 1071 | - weltderberge.de - 3 pages mention NZ mountains by name.
|
---|
| 1072 | ~ (arts.mythologica.fr) https://mythologica.fr/oceanie/texte/pantheon_polynesien.pdf - mentions certain Maori Gods and other Polynesian Gods by name.
|
---|
| 1073 | - https://traynews.com - nothing in MRI, misdetected
|
---|
| 1074 | ~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names
|
---|
| 1075 | - http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI
|
---|
[33816] | 1076 | X https://afrikhepri.org/mi/ - autotranslated
|
---|
[33813] | 1077 | - https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word
|
---|
| 1078 | - etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected
|
---|
[33816] | 1079 | - https://www.you-fly.com - misdetection of German "Warum?" as MRI
|
---|
| 1080 | - http://vulkane.ch - misdetected pages on Hawaiian volcanoes.
|
---|
| 1081 | - http://www.stephe.de - photos from NZ captioned with NZ placenames
|
---|
| 1082 | - http://insecta.pro - misdetection
|
---|
| 1083 | - http://m.distanta.1km.net - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence.
|
---|
| 1084 | - https://ersatzteile-fachversand.de - German misdetected as Maori.
|
---|
| 1085 | - https://laskar02cinta.page.tl/Info.htm - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense.
|
---|
| 1086 | - http://www.behlig.de - misdetection. Photos from Hawaii.
|
---|
| 1087 | !! http://www.udhr.de - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available.
|
---|
[33813] | 1088 | - ITALY:
|
---|
| 1089 | http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename
|
---|
| 1090 | http://www.marcosanti.it/Reportage/Oceania_ph/Nuova_Zelanda/ - each photo captioned by NZ placename
|
---|
| 1091 | http://www.pegasoesmicamion.com/ - REO abbreviation misidentified, also in REO%20PUBLICIDAD.htm
|
---|
| 1092 | - AUSTRIA:
|
---|
| 1093 | petit-prince.at - Tahitian and Wayuu (Venezuela) translations of Le Petit Prince
|
---|
| 1094 | http://www.tmtmm.net/newzealand - photos from NZ named after places and people's names
|
---|
| 1095 | - ROMANIA: parohiauceadesus.ro - Sentences of single Romanian words misidentified.
|
---|
| 1096 | - ISRAEL:
|
---|
| 1097 | http://www.daat.ac.il - misidentification of "no." as MRI, and Hebrew words.
|
---|
| 1098 | https://www.hitiaotera.com/ - misidentifiation of Tahitian pages
|
---|
| 1099 | - RUSSIA: https://www.gismeteo.lv - misidentification of an email address
|
---|
| 1100 | - JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI
|
---|
[33816] | 1101 | !! - Ireland, ie: https://coggle.it
|
---|
[33813] | 1102 | - IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website
|
---|
[33816] | 1103 | - CZECH republic:
|
---|
| 1104 | ? https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English
|
---|
| 1105 | !! http://www.henryklahola.nazory.cz/094.Maori.htm and http://henryklahola.nazory.cz variant
|
---|
| 1106 | http://about.ilikeyou.com - dating site. Misidentification.
|
---|
| 1107 | - SPAIN:
|
---|
| 1108 | !! https://www.uv.es/~pla/red.net/intmaori.html
|
---|
| 1109 | https://www.reclamaciondevuelos.com - 2 occurrences of the word "kiwi"
|
---|
| 1110 | http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename
|
---|
| 1111 | http://www.cruceros-princess.mx/princessMX/Oferta_Cruzeiros_Polinesia.html - Polynesian placenames
|
---|
[33813] | 1112 | - SINGAPORE: https://omg-solutions.com - autotranslated
|
---|
| 1113 | - TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated
|
---|
| 1114 | - MEXICO: http://www.gelbukh.com - misidentification, lines of just numbers or phrases like "Area Chair" in English and Russian CVs.
|
---|
| 1115 | - FINLAND: http://pertti.com - travelogue, placenames
|
---|
| 1116 | - SWITZERLAND CH:
|
---|
| 1117 | nicoledidi.ch - blog, placenames
|
---|
| 1118 | https://photos.axelebert.org - Tahiti related content
|
---|
| 1119 | - UNKNOWN: https://www.viveipcl.com: tours website, placenames mentioned
|
---|
| 1120 | #- EU: https://www.the-good-stuff-factory.be/mi/ : Autotranslated
|
---|
| 1121 | !! - BULGARIA: http://anitra.net/activism/humanrights/UDHR/rrt_print.htm (2 pages)
|
---|
| 1122 |
|
---|
| 1123 |
|
---|
| 1124 | TREATING AUSTRALIA AND GREAT BRITAIN MORE SPECIALLY (don't ignore /mi in URL, same as with NZ, but do leave out .nz TLDs since we cover them under NZ - TODO: later find country codes of all .nz TLDs):
|
---|
| 1125 | [nothing found under "UK", only under "GB"]
|
---|
| 1126 |
|
---|
| 1127 | db.getCollection('Websites').find({
|
---|
| 1128 | domain: {$not: /.nz$/},
|
---|
| 1129 | numPagesContainingMRI: {$gt: 0},
|
---|
| 1130 | $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}]
|
---|
| 1131 | }).count()
|
---|
| 1132 | 11
|
---|
| 1133 |
|
---|
| 1134 | db.Websites.aggregate([
|
---|
| 1135 | {
|
---|
| 1136 | $match: {
|
---|
| 1137 | domain: {$not: /.nz$/},
|
---|
| 1138 | numPagesContainingMRI: {$gt: 0},
|
---|
| 1139 | $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}]
|
---|
| 1140 | }
|
---|
| 1141 | },
|
---|
| 1142 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1143 | {
|
---|
| 1144 | $group: {
|
---|
| 1145 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 1146 | count: { $sum: 1 },
|
---|
| 1147 | domain: { $addToSet: '$domain' }
|
---|
| 1148 | }
|
---|
| 1149 | },
|
---|
| 1150 | { $sort : { count : -1} }
|
---|
| 1151 | ]);
|
---|
| 1152 |
|
---|
| 1153 | AUSTRALIA:
|
---|
| 1154 | !! https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]
|
---|
| 1155 | ? http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!")
|
---|
| 1156 | !! https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd - site of individual pages (like docs.google.com). This one has a relevant infogram image.
|
---|
| 1157 | !! https://koreromaori.com - some actual Maori language sentences
|
---|
| 1158 | http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames
|
---|
| 1159 |
|
---|
| 1160 | UK:
|
---|
| 1161 | http://www.wordsearchfun.com/200628_Word_Find_wordsearch.html - 2 word games with Maori words (one of them has 3 different views, e.g. print view)
|
---|
| 1162 | ? https://omniatlas.com/maps/australasia/18400206/plain/ - historical map with Maori iwi names over NZ map regions
|
---|
| 1163 | ? https://omniatlas.com/maps/australasia/18400206/ - historical map of Australia and NZ at the time of the Treaty of Waitangi, with events marked in English
|
---|
| 1164 | https://centrallanguageschool.com - AUTOTRANSLATED
|
---|
| 1165 | https://www.solasolv.com - Autotranslated product site
|
---|
| 1166 | http://mikestephens.co.uk/ - photo captions containing NZ placenames
|
---|
| 1167 | http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames
|
---|
[33816] | 1168 |
|
---|
[33807] | 1169 | --------------
|
---|
[33710] | 1170 |
|
---|
[33807] | 1171 | GETTING TABLE DATA OUT OF MONGO DB:
|
---|
[33710] | 1172 |
|
---|
[33807] | 1173 | https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo
|
---|
| 1174 | "export to file" as in a spreadsheet like to a .csv?
|
---|
[33710] | 1175 |
|
---|
[33807] | 1176 | IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo):
|
---|
[33710] | 1177 |
|
---|
[33807] | 1178 | 1. In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything
|
---|
[33710] | 1179 |
|
---|
[33807] | 1180 | 2. paste everything into this website: https://json-csv.com/
|
---|
[33710] | 1181 |
|
---|
[33807] | 1182 | 3. click the download button and now you have it in a spreadsheet.
|
---|
[33710] | 1183 |
|
---|
| 1184 |
|
---|
[33807] | 1185 | https://json-csv.com/
|
---|
[33710] | 1186 |
|
---|
| 1187 |
|
---|
[33807] | 1188 | ---------------------
|
---|
[33813] | 1189 |
|
---|
| 1190 | Count of websites that have at least 1 page containing at least one sentence detected as MRI
|
---|
| 1191 | AND which websites have mi in the URL path:
|
---|
| 1192 |
|
---|
| 1193 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
|
---|
| 1194 |
|
---|
| 1195 | 491
|
---|
| 1196 |
|
---|
| 1197 |
|
---|
| 1198 |
|
---|
| 1199 | # The websites that have some MRI detected AND which are either in NZ or with NZ TLD
|
---|
| 1200 | # or (so if they're from overseas) don't contain /mi or mi.* in URL path:
|
---|
| 1201 |
|
---|
| 1202 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}]}).count()
|
---|
| 1203 | 396
|
---|
| 1204 |
|
---|
| 1205 | Include Australia (to get the valid "kiwiproperty.com" website included in the result list):
|
---|
| 1206 |
|
---|
| 1207 | db.getCollection('Websites').find({$and: [
|
---|
| 1208 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1209 | {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
|
---|
| 1210 | ]}).count()
|
---|
| 1211 |
|
---|
| 1212 | 397
|
---|
| 1213 |
|
---|
| 1214 | # aggregate results by a count of country codes
|
---|
| 1215 | db.Websites.aggregate([
|
---|
| 1216 | {
|
---|
| 1217 | $match: {
|
---|
| 1218 | $and: [
|
---|
| 1219 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1220 | {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
|
---|
| 1221 | ]
|
---|
| 1222 | }
|
---|
| 1223 | },
|
---|
| 1224 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1225 | {
|
---|
| 1226 | $group: {
|
---|
| 1227 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 1228 | count: { $sum: 1 }
|
---|
| 1229 | }
|
---|
| 1230 | },
|
---|
| 1231 | { $sort : { count : -1} }
|
---|
| 1232 | ]);
|
---|
| 1233 |
|
---|
| 1234 |
|
---|
| 1235 | # Just considering those sites outside NZ or not with .nz TLD:
|
---|
| 1236 | db.Websites.aggregate([
|
---|
| 1237 | {
|
---|
| 1238 | $match: {
|
---|
| 1239 | $and: [
|
---|
| 1240 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 1241 | {domain: {$not: /\.nz/}},
|
---|
| 1242 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1243 | {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
|
---|
| 1244 | ]
|
---|
| 1245 | }
|
---|
| 1246 | },
|
---|
| 1247 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1248 | {
|
---|
| 1249 | $group: {
|
---|
| 1250 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
| 1251 | count: { $sum: 1 },
|
---|
| 1252 | domain: { $addToSet: '$domain' }
|
---|
| 1253 | }
|
---|
| 1254 | },
|
---|
| 1255 | { $sort : { count : -1} }
|
---|
| 1256 | ]);
|
---|
| 1257 |
|
---|
| 1258 |
|
---|
[33823] | 1259 | # counts by country code excluding NZ related sites
|
---|
| 1260 | db.getCollection('Websites').find({$and: [
|
---|
| 1261 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
| 1262 | {domain: {$not: /\.nz/}},
|
---|
| 1263 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1264 | {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
|
---|
| 1265 | ]}).count()
|
---|
| 1266 |
|
---|
| 1267 | 221 websites
|
---|
| 1268 |
|
---|
| 1269 |
|
---|
[33813] | 1270 | # But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
|
---|
| 1271 | db.getCollection('Websites').find({$and: [
|
---|
| 1272 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1273 | {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
|
---|
| 1274 | ]}).count()
|
---|
| 1275 |
|
---|
| 1276 | 176
|
---|
| 1277 |
|
---|
| 1278 | (Total is 221+176 = 397, which adds up).
|
---|
| 1279 |
|
---|
| 1280 | # Get the count (and domain listing) output put under a hardcoded _id of "nz":
|
---|
| 1281 | db.Websites.aggregate([
|
---|
| 1282 | {
|
---|
| 1283 | $match: {
|
---|
| 1284 | $and: [
|
---|
| 1285 | {numPagesContainingMRI: {$gt: 0}},
|
---|
| 1286 | {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
|
---|
| 1287 | ]
|
---|
| 1288 | }
|
---|
| 1289 | },
|
---|
| 1290 | { $unwind: "$geoLocationCountryCode" },
|
---|
| 1291 | {
|
---|
| 1292 | $group: {
|
---|
| 1293 | _id: "nz",
|
---|
| 1294 | count: { $sum: 1 },
|
---|
| 1295 | domain: { $addToSet: '$domain' }
|
---|
| 1296 | }
|
---|
| 1297 | },
|
---|
| 1298 | { $sort : { count : -1} }
|
---|
| 1299 | ]);
|
---|
[33816] | 1300 |
|
---|
| 1301 |
|
---|
| 1302 | -----------------------
|
---|
[33823] | 1303 | US:
|
---|
[33816] | 1304 | Done: manually inspected 68/117 sites
|
---|
| 1305 |
|
---|
[33823] | 1306 | TOTAL US: 4+7+7+4+3=25
|
---|
| 1307 |
|
---|
[33816] | 1308 | DEFINITELY:
|
---|
| 1309 | + http://anglicanhistory.org,
|
---|
| 1310 | + http://www.unicode.org, [Universal declaration of Human Rights]
|
---|
| 1311 | + https://static-promote.weebly.com,
|
---|
| 1312 | + http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY]
|
---|
| 1313 |
|
---|
| 1314 | BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations:
|
---|
| 1315 | + http://bahaiprayers.net, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too]
|
---|
| 1316 | + https://biblehub.com,
|
---|
| 1317 | + http://www.muhammad.com, [possibly not autotranslated]
|
---|
| 1318 | + http://www.godrules.net, [possibly not autotranslated]
|
---|
| 1319 | + http://m.biblepub.com,
|
---|
| 1320 | + http://www.krassotkin.ru, [probably real translations, as there are multiple Dutch translations from different sources provided]
|
---|
| 1321 | + http://www.gotquestions.org, [doesn't appear autotranslated]
|
---|
| 1322 | X https://ebible.org, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori]
|
---|
| 1323 | X https://www.bible.com, doesn't have Maori translation. Misdetected.
|
---|
| 1324 | X https://wol.jw.org, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected
|
---|
| 1325 | X https://png.bible, [misdetected, Papua New Guinea]
|
---|
| 1326 | X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters.
|
---|
| 1327 |
|
---|
[33823] | 1328 | CHECK - PROBABLY:
|
---|
[33816] | 1329 | !! https://maorinews.com,
|
---|
| 1330 | !! http://maaori.com,
|
---|
| 1331 | !!+ http://kiaorahola.blogspot.com/
|
---|
| 1332 | + https://kjohnsonnz.blogspot.com,
|
---|
| 1333 | + http://pumanawawhangara.blogspot.com,
|
---|
| 1334 | + http://dannykahei.tripod.com,
|
---|
| 1335 | + http://burkekm001.tripod.com
|
---|
| 1336 | + http://tkkpipipaopao.blogspot.com,
|
---|
| 1337 | + http://manateina.blogspot.com,
|
---|
| 1338 | ? tkkpipipaopao.blogspot.com? http://rangiwewehi.com, [English, but community]
|
---|
| 1339 | ? https://www.terakau.org, [COMMUNITY, but English]
|
---|
| 1340 | ? https://www.pipirikiapapatuanuku.org, [COMMUNITY?, in English, environment site]
|
---|
| 1341 | ~ http://georgegi.tripod.com,
|
---|
| 1342 | ~ http://ngarangatahi.tripod.com, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden]
|
---|
| 1343 | X http://fhr.kiwicelts.com,
|
---|
| 1344 | X http://tkrow.tripod.com, [English, background of NZ place]
|
---|
| 1345 | X http://www.mkiwi.com, - placenames
|
---|
| 1346 | X http://www.waimate.com, [English, NZ place]
|
---|
| 1347 |
|
---|
| 1348 | MAYBE, INSPECT:
|
---|
| 1349 | ? https://www.natekore2018.com, [lots of English, but COMMUNITY, CULTURE]
|
---|
| 1350 | + http://tatai09.blogspot.com,
|
---|
| 1351 | + http://www.twttoa.com,
|
---|
| 1352 | + http://tuhua2010.blogspot.com,
|
---|
| 1353 | X http://www.huapala.org, [misdetected, Hawaiian]
|
---|
| 1354 | X https://www.vaihaunui.net, [misdetected, Tahiti]
|
---|
| 1355 | X https://www.kaifineart.com, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected]
|
---|
| 1356 | X http://mahoraroom8.blogspot.com, [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI
|
---|
| 1357 | + http://piripi.blogspot.com,
|
---|
| 1358 | X http://www.hiroa.pf, [misdetected. Crawled content appears Polynesian not Maori]
|
---|
| 1359 | X http://korora.econ.yale.edu, [NZ place photo caption]
|
---|
| 1360 | X https://www.poehalisnami.ua, [mostly Cyrillic, with some NZ or Polynesian names misdetected]
|
---|
| 1361 | X http://hannas-reiseblog.blogspot.com - one page contained NZ placenames, another had a word misdetected
|
---|
| 1362 |
|
---|
| 1363 |
|
---|
| 1364 | + https://www.breaker.audio, [audio, with occasional English.]
|
---|
| 1365 | ? https://livestream.com, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?]
|
---|
| 1366 |
|
---|
| 1367 | X https://docs.google.com, timetable with occasional Maori language word
|
---|
| 1368 | + https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position.
|
---|
| 1369 | http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content.
|
---|
| 1370 |
|
---|
| 1371 |
|
---|
| 1372 | PINTEREST
|
---|
| 1373 | + https://in.pinterest.com/pin/317363104978423418/
|
---|
| 1374 | "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google"
|
---|
| 1375 | ? https://za.pinterest.com/pin/524669425310419500/
|
---|
| 1376 | Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE]
|
---|
| 1377 | [The other pinterest detected as numPagesContainingMRI > 0 was misdetected]
|
---|
| 1378 |
|
---|
| 1379 | https://nl.pinterest.com,
|
---|
| 1380 | https://www.pinterest.jp,
|
---|
| 1381 | https://www.pinterest.it,
|
---|
| 1382 | https://www.pinterest.co.uk,
|
---|
| 1383 | https://www.pinterest.ca,
|
---|
| 1384 | https://za.pinterest.com,
|
---|
| 1385 | https://www.pinterest.fr,
|
---|
| 1386 | https://in.pinterest.com,
|
---|
| 1387 |
|
---|
| 1388 | MORE BLOGSPOTS
|
---|
| 1389 | X http://word-dialect.blogspot.com, [Indonesian, misdetected]
|
---|
| 1390 | ~ http://atopeconlostopes.blogspot.com, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.]
|
---|
| 1391 | X http://lianzaconference2012.blogspot.com, [NZ placename or institution]
|
---|
| 1392 | ? http://mrshamiltonskoolkidz.blogspot.com, [te reo Maori related school activities. Described in English.]
|
---|
| 1393 | X http://capsuraotearoa.blogspot.com, [blog in French, photo captions contain NZ placenames]
|
---|
| 1394 | X http://blogdepasopor.blogspot.com, [blog in French, Rapa Nui/Easter Island related content, misdetected.]
|
---|
| 1395 |
|
---|
| 1396 |
|
---|
| 1397 | UNLIKELY
|
---|
| 1398 | ?? http://naturalfatburner.net, http://naturalfatburner.net/NoNonsenseTed/fatloss-mao/ feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick]
|
---|
| 1399 |
|
---|
| 1400 |
|
---|
| 1401 | BLACKLIST:
|
---|
| 1402 | X http://ww25.milfsplease.com,
|
---|
| 1403 | X http://www.the-naked.com
|
---|
| 1404 |
|
---|
| 1405 | OTHER:
|
---|
| 1406 | X http://seapixonline.com, https://www.seapixonline.com, [photo captions of ships. Sometimes misdetected Japanese words as MRI.]
|
---|
| 1407 | X http://www.code-postal.com, https://www.code-postal.com, [not more than 1 sentence detected as in MRI]
|
---|
| 1408 | X https://www.dbnames.net, [Name database, lots misdetected]
|
---|
| 1409 |
|
---|
| 1410 | STILL TO DO LIST:
|
---|
| 1411 |
|
---|
| 1412 | X https://www.myadsclassified.com, [misdetected 3 short English sentences as MRI]
|
---|
| 1413 | X http://www.whoisthatr.com, [misdetected short English sentence as MRI]
|
---|
| 1414 | X https://www.oemsec.com, [autotranslated product site]
|
---|
| 1415 | X http://svenskadress.net, [linkfarm like site of related junk links, contained URLs misdetected as MRI]
|
---|
| 1416 |
|
---|
| 1417 | X https://www.webwiki.com, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ]
|
---|
| 1418 | X http://mikebonnice.com, [Hawaiian and Tahiti related content misdetected]
|
---|
| 1419 | X http://www.hudl.com, [misdetected short English sentence as MRI]
|
---|
| 1420 | X http://www.wikitree.com, [misdetected short English sentence as MRI]
|
---|
| 1421 | X http://shuttersportnelson.photoshelter.com, [image captions of "Wairua Warrior"]
|
---|
| 1422 |
|
---|
| 1423 | X http://niken8media.logdown.com, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.]
|
---|
| 1424 | X https://www.podrozeady.com, Looks Polish or other East-European language. The NZ page https://www.podrozeady.com/NZ/4/ had placenames detected.
|
---|
| 1425 |
|
---|
| 1426 | X http://www.thesalmons.org, [detection and misdetection of author names of papers hosted]
|
---|
| 1427 |
|
---|
| 1428 | X http://linkvip.top, [.rar and media file links misdetected as MRI]
|
---|
| 1429 |
|
---|
| 1430 |
|
---|
| 1431 | X http://www.lunar-occultations.com, [NZ place names for astronomical phenomena]
|
---|
| 1432 | X http://shangrilapress.net, [NZ placenames]
|
---|
| 1433 | X http://malecek.com, [misdetection CD title]
|
---|
| 1434 | X https://www.blue-frontiers.com, [Tahitian, Reo Tahiti, misdetected as MRI]
|
---|
| 1435 | X http://www.whoisentry.com, [URL names, looked at several which were probably misdetected as MRI]
|
---|
| 1436 | X http://loquevendra318.com, [uses Google translate for auto-translation]
|
---|
| 1437 |
|
---|
| 1438 |
|
---|
| 1439 | ?? http://www.forensicfashion.com, [historical information, useful for CULTURE? e.g. http://www.forensicfashion.com/1807MaoriChief.html]
|
---|
| 1440 |
|
---|
| 1441 | X http://www.eyecontactsite.com, [Lots of names. And a few short sentences or words possibly in comments.]
|
---|
| 1442 | X http://eartheum.com, [Rapa Nui, Easter Island related content. Misdetected]
|
---|
| 1443 | X http://www.steve-wheeler.co.uk, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"]
|
---|
| 1444 | X https://chromium.googlesource.com, [some source code related to languages' two letter codes]
|
---|
| 1445 |
|
---|
| 1446 | X http://www.roadsmile.com, [Lots of misdetection based on word Kia.]
|
---|
| 1447 | ?? https://www.knowatom.com, https://phet.colorado.edu [Similar looking science web sites for children. Uses auto-translation?]
|
---|
| 1448 |
|
---|
| 1449 | X https://www.indexmundi.com, [place names. Pages about Solomon Islands. Misdetection of placenames.]
|
---|
| 1450 |
|
---|
| 1451 |
|
---|
| 1452 |
|
---|
| 1453 | X http://wowwars.net, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ]
|
---|
| 1454 | ?? https://www.hidroponia.org.mx, [Not sure if https://www.hidroponia.org.mx/index.php/idiomas/284-hydroponics-te-ahurea-wai-maori is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?]
|
---|
| 1455 | X http://www.v3whois.com, [URLs are misdetected as MRI]
|
---|
| 1456 | X http://rhymebrain.com, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages]
|
---|
| 1457 |
|
---|
| 1458 |
|
---|
| 1459 | X SINGLE SENTENCE DETECTED (NO MORE AND NOT PAGE:)
|
---|
| 1460 | http://frontrowphotos.com,
|
---|
| 1461 | http://www.pressreader.com,
|
---|
| 1462 | https://www.nccri.ie,
|
---|
| 1463 | http://takethatvacation.com,
|
---|
| 1464 | http://worldradiomap.com,
|
---|
| 1465 | http://www.namesdir.com,
|
---|
| 1466 |
|
---|
| 1467 | X http://www.frogsonline.com, [NZ hotels, placenames]
|
---|
| 1468 | X http://www.geni.com, [Single sentence misdetection]
|
---|
| 1469 | X http://wikiedit.org, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui]
|
---|
| 1470 |
|
---|
| 1471 |
|
---|
[33823] | 1472 |
|
---|
| 1473 | ---------------
|
---|
| 1474 |
|
---|
| 1475 | MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY
|
---|
| 1476 | NZ: 176
|
---|
| 1477 | US: 25
|
---|
| 1478 | AU: 3
|
---|
| 1479 | FR: 1
|
---|
| 1480 | DK: 2
|
---|
| 1481 | (CA: 0.5)
|
---|
| 1482 | DE: 2
|
---|
| 1483 | IE (Ireland): 1
|
---|
| 1484 | CZ: 1
|
---|
| 1485 | ES: 1
|
---|
| 1486 | BG: 1
|
---|
| 1487 |
|
---|
| 1488 | TIDIED:
|
---|
| 1489 | NZ: 176
|
---|
| 1490 | US: 25
|
---|
| 1491 | AU: 3
|
---|
| 1492 | DE: 2
|
---|
| 1493 | DK: 2
|
---|
| 1494 | BG: 1
|
---|
| 1495 | CZ: 1
|
---|
| 1496 | ES: 1
|
---|
| 1497 | FR: 1
|
---|
| 1498 | IE: 1
|
---|
| 1499 | TOTAL: 213
|
---|
| 1500 |
|
---|
| 1501 |
|
---|