MongoDB Installation: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/ https://docs.mongodb.com/manual/administration/install-on-linux/ https://hevodata.com/blog/install-mongodb-on-ubuntu/ https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04 CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/ FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source GUI: https://robomongo.org/ Robomongo is Robo 3T now https://www.tutorialspoint.com/mongodb/mongodb_java.htm JAR FILE: http://central.maven.org/maven2/org/mongodb/mongo-java-driver/ https://mongodb.github.io/mongo-java-driver/ https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/ http://www.programmersought.com/article/6500308940/ 52 sudo apt-get install mongodb-clients 53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p Failed with Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148 exception: connect failed This is due to a version incompatibility between Client and mongodb Server. The solution is to follow instructions at http://www.programmersought.com/article/6500308940/ and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/ as below: 54 sudo apt-get purge mongodb-clients 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list 57 sudo apt-get update 58 sudo apt-get install mongodb-clients 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p (still doesn't work) 60 sudo apt-get install -y mongodb-org The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server? 72 sudo service mongod status 103 sudo service mongod start "mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections 104 sudo service mongod status 88 sudo service mongod stop DETAILS: wharariki:[879]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p didn't work with the pwd. Failed with: MongoDB shell version: 2.6.10 Enter password: connecting to: mongodb://mongodb.cms.waikato.ac.nz:27017 2019-11-04T20:02:47.970+1300 Assertion: 13110:HostAndPort: host is empty 2019-11-04T20:02:47.970+1300 0x6b75c9 0x659e9f 0x636f69 0x4fa55c 0x501249 0x4fa7f1 0x6006fd 0x5eb869 0x7f7bfbd47d76 0x1f3c10d06362 mongo(_ZN5mongo15printStackTraceERSo+0x39) [0x6b75c9] mongo(_ZN5mongo10logContextEPKc+0x21f) [0x659e9f] mongo(_ZN5mongo11msgassertedEiPKc+0xd9) [0x636f69] mongo(_ZN5mongo16ConnectionString12_fillServersENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50c) [0x4fa55c] mongo(_ZN5mongo16ConnectionStringC1ENS0_14ConnectionTypeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_+0x99) [0x501249] mongo(_ZN5mongo16ConnectionString5parseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERS6_+0x201) [0x4fa7f1] mongo(_ZN5mongo17mongoConsExternalEPNS_7V8ScopeERKN2v89ArgumentsE+0x11d) [0x6006fd] mongo(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0xa9) [0x5eb869] /usr/lib/libv8.so.3.14.5(+0x99d76) [0x7f7bfbd47d76] [0x1f3c10d06362] 2019-11-04T20:02:47.971+1300 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148 exception: connect failed This is due to a version incompatibility between Client and mongodb Server. Can find client version above. (2.6.10) Server version can be found by running the mongo client shell. Doing so without loading a db: wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION type "help" for help > help db.help() help on db methods db.mycoll.help() help on collection methods sh.help() sharding helpers rs.help() replica set helpers help admin administrative help help connect connecting to a db help help keys key shortcuts help misc misc things to know help mr mapreduce show dbs show database names show collections show collections in current database show users show users in current database show profile show most recent system.profile entries with time >= 1ms show logs show the accessible logger names show log [name] prints out the last segment of log in memory, 'global' is default use set current database db.foo.find() list objects in collection foo db.foo.find( { a : 1 } ) list objects in foo where a == 1 it result of the last line evaluated; use to further iterate DBQuery.shellBatchSize = x set default number of items to display on shell exit quit the mongo shell > help connect Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options. Additional connections may be opened: var x = new Mongo('host[:port]'); var mydb = x.getDB('mydb'); or var mydb = connect('host[:port]/mydb'); Note: the REPL prompt only auto-reports getLastError() for the shell command line connection. Getting help on connect options: > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017'); > var mydb = x.getDB('anupama'); > mydb.connect.help() DBCollection help db.connect.find().help() - show DBCursor help db.connect.count() db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied. db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command db.connect.dataSize() db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' ) db.connect.drop() drop the collection db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } ) db.connect.dropIndexes() db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups db.connect.reIndex() db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return. e.g. db.connect.find( {x:77} , {name:1, x:1} ) db.connect.find(...).count() db.connect.find(...).limit(n) db.connect.find(...).skip(n) db.connect.find(...).sort(...) db.connect.findOne([query]) db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } ) db.connect.getDB() get DB object associated with collection db.connect.getPlanCache() get query plan cache associated with collection db.connect.getIndexes() db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } ) db.connect.insert(obj) db.connect.mapReduce( mapFunction , reduceFunction , ) db.connect.aggregate( [pipeline], ) - performs an aggregation on a collection; returns a cursor db.connect.remove(query) db.connect.renameCollection( newName , ) renames the collection. db.connect.runCommand( name , ) runs a db command with the given name where the first param is the collection name db.connect.save(obj) db.connect.stats() db.connect.storageSize() - includes free space allocated to this collection db.connect.totalIndexSize() - size in bytes of all the indexes db.connect.totalSize() - storage allocated for all data and indexes db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi db.connect.validate( ) - SLOW db.connect.getShardVersion() - only for use with sharding db.connect.getShardDistribution() - prints statistics about data distribution in the cluster db.connect.getSplitKeysForChunks( ) - calculates split points over all chunks and returns splitter function db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set db.connect.setWriteConcern( ) - sets the write concern for writes to the collection db.connect.unsetWriteConcern( ) - unsets the write concern for writes to the collection > mydb.version() 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION (Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb) Finally we now know the mongodb server version: 4.0.13 This version doesn't work with our mongo client (shell) version of 2.6.10. DETAILS OF INSTALLING MONGO-CLIENT AND UPDATING IT, AND INSTALLING MONGODB SERVER: 54 sudo apt-get purge mongodb-clients 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list 57 sudo apt-get update 58 sudo apt-get install mongodb-clients 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p 60 sudo apt-get install -y mongodb-org 61 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p 62 sudo service apache2 status 63 sudo service sshd status 64 sudo service mongodb status 65 sudo service mongo status 66 mongod 67 mongod --help 68 mongod --help | less 69 mongod -f /etc/mongod.conf 70 sudo mongod -f /etc/mongod.conf 71 less /etc/mongod.conf 72 sudo service mongod status 73 sudo service mongod start 74 sudo service mongod status 75 ls -l /var/log/mongodb/mongod.log 76 sudo rm /var/log/mongodb/mongod.log 77 sudo service mongod status 78 sudo service mongod start 79 sudo service mongod status 80 sudo service mongod stop 81 ps auxww | grep mongo 82 sudo service mongod start 83 sudo service mongod status 84 ps auxww | grep mongo 85 sudo dmsg 86 sudo dmesg 87 sudo service mongod status 88 sudo service mongod stop 89 sudo service mongod start 90 sudo dmesg 91 sudo less /var/log/mongodb/mongod.log 92 ls /var/lib/ 93 ls -ld /var/lib/ 94 ls -l /var/log/mongodb/mongod.log 95 ls -ld /var/lib/ 96 groups mongodb 97 less /etc/mongod.conf 98 sudo less /var/log/mongodb/mongod.log 99 less /etc/mongod.conf 100 ls -l /var/lib/mongodb/ 101 sudo chown -R mongodb /var/lib/mongodb/ 102 sudo chgrp -R mongodb /var/lib/mongodb/ 103 sudo service mongod start 104 sudo service mongod status 105 history MONGO DB ROBO 3T 1. Download "Double Pack" from https://robomongo.org/ 2. Untar its contents. Then untar the tarball in that. 3. Run: wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t =================== On analytics, vagrant node1, we've installed the mongodb server and client. We're able to successfully create collections on here. vagrant@node1:~$ mongo MongoDB shell version v4.2.1 connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb Implicit session: session { "id" : UUID("87bb585c-4685-47f6-bf89-a93801daeb2d") } MongoDB server version: 4.2.1 Server has startup warnings: 2019-11-04T07:48:14.197+0000 I STORAGE [initandlisten] 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** See http://dochub.mongodb.org/core/prodnotes-filesystem 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database. 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted. 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] --- Enable MongoDB's free cloud-based monitoring service, which will then receive and display metrics about your deployment (disk utilization, CPU, operation statistics, etc). The monitoring data will be available on a MongoDB website with a unique URL accessible to you and anyone you share the URL with. MongoDB may use this information to make product improvements and to suggest MongoDB products and deployment options to you. To enable free monitoring, run the following command: db.enableFreeMonitoring() To permanently disable this reminder, run the following command: db.disableFreeMonitoring() --- > show dbs admin 0.000GB config 0.000GB local 0.000GB > use db ateacrawldata 2019-11-05T05:24:20.155+0000 E QUERY [js] Error: [db ateacrawldata] is not a valid database name : Mongo.prototype.getDB@src/mongo/shell/mongo.js:51:12 getDatabase@src/mongo/shell/session.js:913:28 DB.prototype.getSiblingDB@src/mongo/shell/db.js:22:12 shellHelper.use@src/mongo/shell/utils.js:803:10 shellHelper@src/mongo/shell/utils.js:790:15 @(shellhelp2):1:1 > db.createCollection('webpages'); { "ok" : 1 } > db.webpages.drop(); ... ^C > db.webpages.drop(); true > use ateacrawldata switched to db ateacrawldata > db.createCollection('webpages'); { "ok" : 1 } > show collections webpages > db.createCollection('websites'); { "ok" : 1 } > ------------------------ Ask Clint to rename "anupama" database to "ateacrawldata" database following the instructions at: https://stackoverflow.com/questions/9201832/how-do-you-rename-a-mongodb-database I don't have permissions to do this. Nor do I have permissions to create Mongo collections within a new database that I create, like ateacrawldata. I only seem to have rights to the "anupama" database. ----------------------- Vagrant virtual machine Node1 has the mongodb installed. After doing "vagrant up" on node1 to start node1: [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh vagrant@node1:~$ mongo MongoDB shell version v4.2.1 connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb 2019-11-13T09:22:46.996+0000 E QUERY [js] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused : connect@src/mongo/shell/mongo.js:341:17 @(connect):2:6 2019-11-13T09:22:46.999+0000 F - [main] exception: connect failed 2019-11-13T09:22:46.999+0000 E - [main] exiting with code 1 vagrant@node1:~$ sudo service mongod status ● mongod.service - MongoDB Database Server Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: https://docs.mongodb.org/manual vagrant@node1:~$ sudo service mongod start vagrant@node1:~$ sudo service mongod status ● mongod.service - MongoDB Database Server Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled) Active: active (running) since Wed 2019-11-13 09:24:07 UTC; 2s ago Docs: https://docs.mongodb.org/manual Main PID: 4383 (mongod) Tasks: 32 Memory: 199.3M CPU: 754ms CGroup: /system.slice/mongod.service └─4383 /usr/bin/mongod --config /etc/mongod.conf Nov 13 09:24:07 node1 systemd[1]: Started MongoDB Database Server. vagrant@node1:~$ So now mongodb is running on node1 on localhost:27017. Next, in another x-term connected to analytics' node1 Vagrant VM, port forward node1's localhost:27017 to analytics' localhost:27017: vagrant ssh -- -L 27017:localhost:27017 Finally, in another x-term, port-forward from analytics:27017 to current machine's 27017: ssh -L 27017:localhost:27017 analytics Now can connect Robo-3T running on current machine to localhost:27017. Then in a new x-term, can use the client mongo shell to connect (by default to localhost:27017): wharariki:[122]/Scratch/ak19/GS309>mongo --shell MongoDB shell version v4.0.13 connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb ... > show dbs admin 0.000GB ateacrawldata 1.532GB config 0.000GB local 0.000GB > use ateacrawldata > show collections Webpages Websites oldwebpages oldwebsites ------------------- Country code to geolocation CSV file found by Dr Bainbridge: https://developers.google.com/public-data/docs/canonical/countries_csv Import into mongodb with: https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv NOTE: mongoimport is a commandline utility and not a command to be run from the mongo shell. See https://jira.mongodb.org/browse/DOCS-11072 This means, in an x-term, DON'T RUN MONGO SHELL/client first. Instead, directly from x-term, run the following to import the countrycodes.csv file: mongoimport -d ateacrawldata -c countrylocations --type csv --file /Scratch/ak19/maori-lang-detection/MoreReading/countrycodes.csv --headerline ------------------------- MONGODB QUERIES: db.getCollection('webpages').find({"isMRI": true, "singleSentences.langCode": "mri"}) db.getCollection('webpages').find({"singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) db.getCollection('Webpages').find({"isMRI": true, "singleSentences": { $elemMatch: {"langCode":"eng"} } }, {"singleSentences.$": "eng"}) [single English lang sentence] db.getCollection('Webpages').find({"containsMRI": true, "singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) [gets 1st sentence of docs which have sentences containing MRI] READING mongodb java convert class https://www.quora.com/What-are-the-ways-of-converting-a-Java-object-to-a-MongoDB-document-and-vice-versa https://stackoverflow.com/questions/39320825/pojo-to-org-bson-document-and-vice-versa X https://mongodb.github.io/morphia/ https://stackoverflow.com/questions/10170506/inserting-java-object-to-mongodb-collection-using-java X https://www.google.com/search?q=morphia+example&oq=morphia+example&aqs=chrome.0.0l6.4223j0j9&sourceid=chrome&ie=UTF-8 https://www.baeldung.com/mongodb-morphia X https://web.archive.org/web/20171117121335/http://mongodb.github.io/morphia/1.3/getting-started/ => https://morphia.dev/1.4/getting-started/quick-tour/ https://github.com/MorphiaOrg/morphia/tree/master/docs/reference mongodb querying https://docs.mongodb.com/manual/tutorial/query-embedded-documents/ https://docs.mongodb.com/manual/tutorial/query-arrays/ https://www.google.com/search?q=mongodb+find+subdocument&oq=mongodb+find+&aqs=chrome.0.69i59j69i57j0l4.7607j1j8&sourceid=chrome&ie=UTF-8 https://stackoverflow.com/questions/25586901/how-to-find-document-and-single-subdocument-matching-given-criterias-in-mongodb https://stackoverflow.com/questions/21113543/mongodb-get-subdocument https://stackoverflow.com/questions/36948856/find-subdocuments-in-mongo https://docs.mongodb.com/v3.0/reference/operator/projection/positional/#proj._S_ https://www.google.com/search?q=mongodb+query+tutorial&oq=mongodb+query+tutorial&aqs=chrome..69i57j0l2j69i60l3.4719j0j7&sourceid=chrome&ie=UTF-8 https://blog.exploratory.io/an-introduction-to-mongodb-query-for-beginners-bd463319aa4c https://docs.mongodb.com/manual/reference/method/db.collection.find/ https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection https://stackoverflow.com/questions/39641925/mongodb-aggregation-framework-to-get-frequencies-of-fields-values https://exploratory.io/note/kanaugust/0961813761939766 https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/ https://docs.mongodb.com/manual/aggregation/ Mongo Studio 3T documentation: https://studio3t.com/download/ (also has uninstall information) https://studio3t.com/download-thank-you/?OS=x64 Google: MongoDB visualization MongoDB visualization map MongoDB Charts (Open source visualisation tools) json map visualizer geojson.tools ------------------- Some queries with results: # Num websites db.getCollection('Websites').find({}).count() 1445 # Num webpages db.getCollection('Webpages').find({}).count() X75139 117496 # Find number of websites who have 1 or more pages in Maori (a positive numPagesInMRI) db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count() 361 # Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count() 868 # Obviously, the union of the above two will be identical to numPagesContainingMRI: db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count() 868 # Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true) db.getCollection('Webpages').find({isMRI:true}).count() X5224 X5215 db.getCollection('Webpages').find({isMRI:true}).count() 7818 # Number of pages that contain any number of MRI sentences db.getCollection('Webpages').find({containsMRI: true}).count() X12858 20371 # Number of sites with URLs containing /mi(/) db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count() 153 # Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count() 147 # 5 sites with URLs containing /mi(/) that are in NZ db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count() 6 # sort websites that contain /mi(/) in path by geoLocationCountryCode # https://www.quackit.com/mongodb/tutorial/mongodb_sort_query_results.cfm db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).sort({geoLocationCountryCode: 1}) Actually, I want to sort by count. See https://docs.mongodb.com/manual/reference/operator/aggregation/sortByCount/ # PROJECTION: db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInPath: 1}) https://docs.mongodb.com/manual/aggregation/ EXAMPLE: db.orders.aggregate([ { $match: { status: "A" } }, { $group: { _id: "$cust_id", total: { $sum: "$amount" } } } ]) X db.Websites.aggregate([{ $match:{urlContainsLangCodeInPath:true}}, $group: {geoLocationCountryCode:1, total: $count}]) X db.Websites.aggregate([ { $match:{urlContainsLangCodeInPath:true}}, {$group: {geoLocationCountryCode:1}} ]) WORKS (but an "unwind" will get rid of "null"): db.Websites.aggregate([ { $match:{urlContainsLangCodeInPath:true}}, {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}}}, { $sort : { count : -1} } ]) # COUNT OF ALL GEOLOCATION COUNTRIES #https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key # LIST db.Websites.distinct('geoLocationCountryCode'); # COUNT db.Websites.distinct('geoLocationCountryCode').length; # A COUNT WITH QUERY - https://docs.mongodb.com/manual/reference/command/distinct/#dbcmd.distinct db.runCommand ( { distinct: "Websites", key: "geoLocationCountryCode", query: { "urlContainsLangCodeInPath": true} } ); # DISTINCT WITH QUERY WITHOUT COUNT - https://docs.mongodb.com/manual/reference/method/db.collection.distinct/ db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}); #SORTED - https://stackoverflow.com/questions/4759437/get-distinct-values-with-sorted-data db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}).sort(); # count of all sites for which the geolocation is UNKNOWN db.getCollection('Websites').find({geoLocationCountryCode: {$eq:"UNKNOWN"}}).count() # AGGREGATION QUERIES THAT WORK: #https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key WORKS: // count of country codes for all sites db.Websites.aggregate([ { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 } } }, { $sort : { count : -1} } ]); // count of country codes for sites that have at least one page detected as MRI db.Websites.aggregate([ { $match: { numPagesInMRI: {$gt: 0} } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); // count of country codes for sites that have at least one page containing at least one sentence detected as MRI db.Websites.aggregate([ { $match: { numPagesContainingMRI: {$gt: 0} } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); WORKS: // count of country codes for sites that have /mi(/) in path db.Websites.aggregate([ { $match: { urlContainsLangCodeInPath: true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); WORKS: db.Websites.aggregate([ { $match: { geoLocationCountryCode: {$ne : "UNKNOWN"} } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 } } }, { $sort : { count : -1} } ]); WORKS: db.Websites.aggregate([ { $match: { "urlContainsLangCodeInPath": true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 } } }, { $sort : { count : -1} } ]); KEEP ADDITIONAL FIELDS - https://stackoverflow.com/questions/16662405/mongo-group-query-how-to-keep-fields: a. KEEPS ONLY FIRST DOMAIN URL FOR EACH COUNTED COUNTRY CODE: db.Websites.aggregate([ { $match: { "urlContainsLangCodeInPath": true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 }, domain: {$first: '$domain'} } }, { $sort : { count : -1} } ]); b. KEEP ALL DOMAIN URLS: db.Websites.aggregate([ { $match: { "urlContainsLangCodeInPath": true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); # WANT TO GET THE ABOVE INTO WORLD MAP, use geojson.tools found by Dr Bainbridge geojson.tools USAGE: https://www.here.xyz/viewer-tool/ AIMS: * Identify where Maori language is online. * How can we identify high quality sites that would be good for a corpus. (Related work for other languages to quantifiably answer that) data-preparation docs ------------------------------------------ BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori --- # https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or # https://docs.mongodb.com/manual/reference/operator/query/and/ # 1. all the websites which are from NZ: db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count() 128 # 2. all the websites that have /mi in URL path which are from NZ: db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}) 6 # 3. all the websites that don't have /mi in URLpath db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count() 1292 # 4. all the websites that don't have /mi, or if they do are from NZ # (should be the sum of the above points 2 and 3 above) db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count() 1298 # 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ # These are the TENTATIVE NON-PRODUCT SITES # Should be less than the point 4, but more than 1 to 3 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count() 859 # 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix) # Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori db.Websites.aggregate([ { $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]} }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); The result is very close to the same aggregate on just numPagesContainingMRI. That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few: db.Websites.aggregate([ { $match: { $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); _id count us 4.0 nz 4.0 au 3.0 ru 1.0 de 1.0 Total: 13 sites that have /mi/ and are detected as having MRI content, db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count() 13 Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI. Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD! /* 1 */ { "_id" : "nz", "count" : 4.0, "domain" : [ "http://firstworldwar.tki.org.nz", "http://www.firstworldwar.tki.org.nz", "https://admin.teara.govt.nz", "http://community.nzdl.org" ] } /* 2 */ { "_id" : "us", "count" : 4.0, "domain" : [ "https://sexualviolence.victimsinfo.govt.nz", "https://follow3rs.com", "http://www.church-of-christ.org", "http://www.mytrickstips.com" ] } /* 3 */ { "_id" : "au", "count" : 3.0, "domain" : [ "https://rapuatearatika.education.govt.nz", "https://www.kiwiproperty.com", "https://curriculumtool.education.govt.nz" ] } /* 4 */ { "_id" : "ru", "count" : 1.0, "domain" : [ "http://www.treningmozga.com" ] } /* 5 */ { "_id" : "de", "count" : 1.0, "domain" : [ "http://www.almancax.com" # Website to learn German, autotranslated ] } But we're not catching a potentially large number of auto-translated sites, like - https://www.gigalight.com/all-languages.html - http://www.hzhinew.com/ https://culturesconnection.com/manual-or-automatic-translation/ Manual Or Automatic Translation? Automatic translation continues to improve day by day. However, it is still unable to reach perfect levels of accuracy and lacks a natural feel. Will it ever replace human translation? -------------- Mr Bill Rogers' suggestions for beginnings of trying to sieve out the auto-translated sites: - skip .com. .co.. But .co.nz is also used for non-commercial sites or sites that nevertheless have high quality Maori language content. - change cut-off value of OpenNLP language prediction? But for sentences and overlapping sentences, we're not using the cut-off value, we're just checking the best predicted language regardless of confidence level for this. - Code for (a range of) loading of language options in auto-translated sites? ==================== # https://stackoverflow.com/questions/20175122/how-can-i-use-not-like-operator-in-mongodb Info on the sites with Maori language content that are either from NZ or have .nz domain (TLD): db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}) db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}).count() 183 Inverse: the sites detected as containing at least 1 Maori language sentence that are NOT from NZ NOR have .nz domain ending (TLD): db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}]}).count() 685 The above two figures correctly add up to a total of 868 sites, which is the number of sites detected as containing at least 1 sentence in MRI: db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count() 868 Without those with /mi in path: db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]}).count() Now let's get a listing of all 685 sites to be manually inspected to determine whether they're auto-translated: /* db.Websites.aggregate([ { $match: { $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); */ db.Websites.aggregate([ { $match: { $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}}, ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); * CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/ BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/ * FR: 35 sites from FR http://blueheavenisland.com - French Polynesia https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori. !! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid * -------------- GETTING TABLE DATA OUT OF MONGO DB: https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo "export to file" as in a spreadsheet like to a .csv? IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo): 1. In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything 2. paste everything into this website: https://json-csv.com/ 3. click the download button and now you have it in a spreadsheet. https://json-csv.com/ ---------------------