MongoDB Installation: CENTOS (Analytics): FROM SOURCE: GUI: Robomongo is Robo 3T now JAR FILE: 52 sudo apt-get install mongodb-clients 53 mongo 'mongodb://' -u anupama -p Failed with Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148 exception: connect failed This is due to a version incompatibility between Client and mongodb Server. The solution is to follow instructions at and then as below: 54 sudo apt-get purge mongodb-clients 55 sudo apt-key adv --keyserver hkp:// --recv 9DA31620334BD75D9DCB49F368818C72E52529D4 56 echo "deb [ arch=amd64,arm64 ] xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list 57 sudo apt-get update 58 sudo apt-get install mongodb-clients 59 mongo 'mongodb://' -u anupama -p (still doesn't work) 60 sudo apt-get install -y mongodb-org The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server? 72 sudo service mongod status 103 sudo service mongod start "mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections 104 sudo service mongod status 88 sudo service mongod stop DETAILS: wharariki:[879]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo 'mongodb://' -u anupama -p didn't work with the pwd. Failed with: MongoDB shell version: 2.6.10 Enter password: connecting to: mongodb:// 2019-11-04T20:02:47.970+1300 Assertion: 13110:HostAndPort: host is empty 2019-11-04T20:02:47.970+1300 0x6b75c9 0x659e9f 0x636f69 0x4fa55c 0x501249 0x4fa7f1 0x6006fd 0x5eb869 0x7f7bfbd47d76 0x1f3c10d06362 mongo(_ZN5mongo15printStackTraceERSo+0x39) [0x6b75c9] mongo(_ZN5mongo10logContextEPKc+0x21f) [0x659e9f] mongo(_ZN5mongo11msgassertedEiPKc+0xd9) [0x636f69] mongo(_ZN5mongo16ConnectionString12_fillServersENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50c) [0x4fa55c] mongo(_ZN5mongo16ConnectionStringC1ENS0_14ConnectionTypeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_+0x99) [0x501249] mongo(_ZN5mongo16ConnectionString5parseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERS6_+0x201) [0x4fa7f1] mongo(_ZN5mongo17mongoConsExternalEPNS_7V8ScopeERKN2v89ArgumentsE+0x11d) [0x6006fd] mongo(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0xa9) [0x5eb869] /usr/lib/ [0x7f7bfbd47d76] [0x1f3c10d06362] 2019-11-04T20:02:47.971+1300 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148 exception: connect failed This is due to a version incompatibility between Client and mongodb Server. Can find client version above. (2.6.10) Server version can be found by running the mongo client shell. Doing so without loading a db: wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION type "help" for help > help help on db methods help on collection methods sharding helpers replica set helpers help admin administrative help help connect connecting to a db help help keys key shortcuts help misc misc things to know help mr mapreduce show dbs show database names show collections show collections in current database show users show users in current database show profile show most recent system.profile entries with time >= 1ms show logs show the accessible logger names show log [name] prints out the last segment of log in memory, 'global' is default use set current database list objects in collection foo { a : 1 } ) list objects in foo where a == 1 it result of the last line evaluated; use to further iterate DBQuery.shellBatchSize = x set default number of items to display on shell exit quit the mongo shell > help connect Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options. Additional connections may be opened: var x = new Mongo('host[:port]'); var mydb = x.getDB('mydb'); or var mydb = connect('host[:port]/mydb'); Note: the REPL prompt only auto-reports getLastError() for the shell command line connection. Getting help on connect options: > var x = new Mongo(''); > var mydb = x.getDB('anupama'); > DBCollection help db.connect.find().help() - show DBCursor help db.connect.count() db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied. db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command db.connect.dataSize() db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' ) db.connect.drop() drop the collection db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } ) db.connect.dropIndexes() db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups db.connect.reIndex() db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return. e.g. db.connect.find( {x:77} , {name:1, x:1} ) db.connect.find(...).count() db.connect.find(...).limit(n) db.connect.find(...).skip(n) db.connect.find(...).sort(...) db.connect.findOne([query]) db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } ) db.connect.getDB() get DB object associated with collection db.connect.getPlanCache() get query plan cache associated with collection db.connect.getIndexes() { key : ..., initial: ..., reduce : ...[, cond: ...] } ) db.connect.insert(obj) db.connect.mapReduce( mapFunction , reduceFunction , ) db.connect.aggregate( [pipeline], ) - performs an aggregation on a collection; returns a cursor db.connect.remove(query) db.connect.renameCollection( newName , ) renames the collection. db.connect.runCommand( name , ) runs a db command with the given name where the first param is the collection name db.connect.stats() db.connect.storageSize() - includes free space allocated to this collection db.connect.totalIndexSize() - size in bytes of all the indexes db.connect.totalSize() - storage allocated for all data and indexes db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi db.connect.validate( ) - SLOW db.connect.getShardVersion() - only for use with sharding db.connect.getShardDistribution() - prints statistics about data distribution in the cluster db.connect.getSplitKeysForChunks( ) - calculates split points over all chunks and returns splitter function db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set db.connect.setWriteConcern( ) - sets the write concern for writes to the collection db.connect.unsetWriteConcern( ) - unsets the write concern for writes to the collection > mydb.version() 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION (Check Mongo server version: Finally we now know the mongodb server version: 4.0.13 This version doesn't work with our mongo client (shell) version of 2.6.10. Download "Double Pack" from 2. Untar its contents. Then untar the tarball in that. 3. Run: wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t =================== On analytics, vagrant node1, we've installed the mongodb server and client. We're able to successfully create collections on here. vagrant@node1:~$ mongo MongoDB shell version v4.2.1 connecting to: mongodb:// Implicit session: session { "id" : UUID("87bb585c-4685-47f6-bf89-a93801daeb2d") } MongoDB server version: 4.2.1 Server has startup warnings: 2019-11-04T07:48:14.197+0000 I STORAGE [initandlisten] 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** See 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database. 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted. 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] --- Enable MongoDB's free cloud-based monitoring service, which will then receive and display metrics about your deployment (disk utilization, CPU, operation statistics, etc). The monitoring data will be available on a MongoDB website with a unique URL accessible to you and anyone you share the URL with. MongoDB may use this information to make product improvements and to suggest MongoDB products and deployment options to you. To enable free monitoring, run the following command: db.enableFreeMonitoring() To permanently disable this reminder, run the following command: db.disableFreeMonitoring() --- > show dbs admin 0.000GB config 0.000GB local 0.000GB > use db ateacrawldata 2019-11-05T05:24:20.155+0000 E QUERY [js] Error: [db ateacrawldata] is not a valid database name : Mongo.prototype.getDB@src/mongo/shell/mongo.js:51:12 getDatabase@src/mongo/shell/session.js:913:28 DB.prototype.getSiblingDB@src/mongo/shell/db.js:22:12 shellHelper.use@src/mongo/shell/utils.js:803:10 shellHelper@src/mongo/shell/utils.js:790:15 @(shellhelp2):1:1 > db.createCollection('webpages'); { "ok" : 1 } > db.webpages.drop(); ... ^C > db.webpages.drop(); true > use ateacrawldata switched to db ateacrawldata > db.createCollection('webpages'); { "ok" : 1 } > show collections webpages > db.createCollection('websites'); { "ok" : 1 } > ------------------------ Ask Clint to rename "anupama" database to "ateacrawldata" database following the instructions at: I don't have permissions to do this. Nor do I have permissions to create Mongo collections within a new database that I create, like ateacrawldata. I only seem to have rights to the "anupama" database. ----------------------- Vagrant virtual machine Node1 has the mongodb installed. After doing "vagrant up" on node1 to start node1: [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh vagrant@node1:~$ mongo MongoDB shell version v4.2.1 connecting to: mongodb:// 2019-11-13T09:22:46.996+0000 E QUERY [js] Error: couldn't connect to server, connection attempt failed: SocketException: Error connecting to :: caused by :: Connection refused : connect@src/mongo/shell/mongo.js:341:17 @(connect):2:6 2019-11-13T09:22:46.999+0000 F - [main] exception: connect failed 2019-11-13T09:22:46.999+0000 E - [main] exiting with code 1 vagrant@node1:~$ sudo service mongod status ● mongod.service - MongoDB Database Server Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: vagrant@node1:~$ sudo service mongod start vagrant@node1:~$ sudo service mongod status ● mongod.service - MongoDB Database Server Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled) Active: active (running) since Wed 2019-11-13 09:24:07 UTC; 2s ago Docs: Main PID: 4383 (mongod) Tasks: 32 Memory: 199.3M CPU: 754ms CGroup: /system.slice/mongod.service └─4383 /usr/bin/mongod --config /etc/mongod.conf Nov 13 09:24:07 node1 systemd[1]: Started MongoDB Database Server. vagrant@node1:~$ So now mongodb is running on node1 on localhost:27017. Next, in another x-term connected to analytics' node1 Vagrant VM, port forward node1's localhost:27017 to analytics' localhost:27017: vagrant ssh -- -L 27017:localhost:27017 Finally, in another x-term, port-forward from analytics:27017 to current machine's 27017: ssh -L 27017:localhost:27017 analytics Now can connect Robo-3T running on current machine to localhost:27017. Then in a new x-term, can use the client mongo shell to connect (by default to localhost:27017): wharariki:[122]/Scratch/ak19/GS309>mongo --shell MongoDB shell version v4.0.13 connecting to: mongodb:// ... > show dbs admin 0.000GB ateacrawldata 1.532GB config 0.000GB local 0.000GB > use ateacrawldata > show collections Webpages Websites oldwebpages oldwebsites ------------------- Country code to geolocation CSV file found by Dr Bainbridge: Import into mongodb with: NOTE: mongoimport is a commandline utility and not a command to be run from the mongo shell. See This means, in an x-term, DON'T RUN MONGO SHELL/client first. Instead, directly from x-term, run the following to import the countrycodes.csv file: mongoimport -d ateacrawldata -c countrylocations --type csv --file /Scratch/ak19/maori-lang-detection/MoreReading/countrycodes.csv --headerline ------------------------- MONGODB QUERIES: db.getCollection('webpages').find({"isMRI": true, "singleSentences.langCode": "mri"}) db.getCollection('webpages').find({"singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) db.getCollection('Webpages').find({"isMRI": true, "singleSentences": { $elemMatch: {"langCode":"eng"} } }, {"singleSentences.$": "eng"}) [single English lang sentence] db.getCollection('Webpages').find({"containsMRI": true, "singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) [gets 1st sentence of docs which have sentences containing MRI] READING mongodb java convert class X X X => mongodb querying Mongo Studio 3T documentation: (also has uninstall information) Google: MongoDB visualization MongoDB visualization map MongoDB Charts (Open source visualisation tools) json map visualizer ------------------- Some queries with results: # Num websites db.getCollection('Websites').find({}).count() 1446 # Num webpages db.getCollection('Webpages').find({}).count() X75139 117496 # Find number of websites who have 1 or more pages in Maori (a positive numPagesInMRI) db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count() 361 # Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true) db.getCollection('Webpages').find({isMRI:true}).count() X5224 X5215 db.getCollection('Webpages').find({isMRI:true}).count() 7818 # Number of pages that contain any number of MRI sentences db.getCollection('Webpages').find({containsMRI: true}).count() X12858 20371 # Number of sites with URLs containing /mi(/) db.getCollection('Websites').find({urlContainsLangCodeInpath:true}).count() 153 # Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls 0db.getCollection('Websites').find({urlContainsLangCodeInpath:true, geoLocationCountryCode: {$ne : "NZ"} }).count() 148 # 5 sites with URLs containing /mi(/) that are in NZ db.getCollection('Websites').find({urlContainsLangCodeInpath:true, geoLocationCountryCode: "NZ"}).count() 5 # sort websites that contain /mi(/) in path by geoLocationCountryCode # db.getCollection('Websites').find({urlContainsLangCodeInpath:true}).sort({geoLocationCountryCode: 1}) Actually, I want to sort by count. See # PROJECTION: db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInpath: 1}) EXAMPLE: db.orders.aggregate([ { $match: { status: "A" } }, { $group: { _id: "$cust_id", total: { $sum: "$amount" } } } ]) X db.Websites.aggregate([{ $match:{urlContainsLangCodeInPath:true}}, $group: {geoLocationCountryCode:1, total: $count}]) X db.Websites.aggregate([ { $match:{urlContainsLangCodeInPath:true}}, {$group: {geoLocationCountryCode:1}} ]) WORKS (but an "unwind" will get rid of "null"): db.Websites.aggregate([ { $match:{urlContainsLangCodeInPath:true}}, {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}}}, { $sort : { count : -1} } ]) # COUNT OF ALL GEOLOCATION COUNTRIES # # LIST db.Websites.distinct('geoLocationCountryCode'); # COUNT db.Websites.distinct('geoLocationCountryCode').length; # A COUNT WITH QUERY - db.runCommand ( { distinct: "Websites", key: "geoLocationCountryCode", query: { "urlContainsLangCodeInPath": true} } ); # DISTINCT WITH QUERY WITHOUT COUNT - db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}); #SORTED - db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}).sort(); # count of all sites for which the geolocation is UNKNOWN db.getCollection('Websites').find({geoLocationCountryCode: {$eq:"UNKNOWN"}}).count() # AGGREGATION QUERIES THAT WORK: # WORKS: // count of country codes for all sites db.Websites.aggregate([ { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 } } }, { $sort : { count : -1} } ]); WORKS: // count of country codes for sites that have /mi(/) in path db.Websites.aggregate([ { $match: { urlContainsLangCodeInPath: true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); WORKS: db.Websites.aggregate([ { $match: { geoLocationCountryCode: {$ne : "UNKNOWN"} } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 } } }, { $sort : { count : -1} } ]); WORKS: db.Websites.aggregate([ { $match: { "urlContainsLangCodeInPath": true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 } } }, { $sort : { count : -1} } ]); KEEP ADDITIONAL FIELDS - a. KEEPS ONLY FIRST DOMAIN URL FOR EACH COUNTED COUNTRY CODE: db.Websites.aggregate([ { $match: { "urlContainsLangCodeInPath": true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 }, domain: {$first: '$domain'} } }, { $sort : { count : -1} } ]); b. KEEP ALL DOMAIN URLS: db.Websites.aggregate([ { $match: { "urlContainsLangCodeInPath": true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); # WANT TO GET THE ABOVE INTO WORLD MAP, use found by Dr Bainbridge USAGE: AIMS: * Identify where Maori language is online. * How can we identify high quality sites that would be good for a corpus. (Related work for other languages to quantifiably answer that) data-preparation docs /* 1 */ { "_id" : "US", "count" : 93.0, -95.8,40.33 } /* 2 */ { "_id" : "AU", "count" : 7.0, 135.8,-25.33 } /* 3 */ { "_id" : "CN", "count" : 7.0, 100.8, 32.33 } /* 4 */ { "_id" : "NZ", "count" : 5.0, 175.8, -40.33 } /* 5 */ { "_id" : "DE", "count" : 5.0, 10.8, 50.33 } /* 6 */ { "_id" : "HK", "count" : 5.0, 114, 22.33 } /* 7 */ { "_id" : "RU", "count" : 4.0, 38.4, 55.5 } /* 8 */ { "_id" : "JP", "count" : 3.0, 137.8, 36 } /* 9 */ { "_id" : "GB", "count" : 3.0, -2, 53.33 } /* 10 */ { "_id" : "CA", "count" : 2.0, -105.8, 55.33 } /* 11 */ { "_id" : "FR", "count" : 2.0, 3, 47.33 } /* 12 */ { "_id" : "DK", "count" : 2.0, 9.5, 55.33 } /* 13 British Virgin Islands */ { "_id" : "VG", "count" : 2.0, -64.8, 18.35 } /* 14 Ukraine */ { "_id" : "UA", "count" : 1.0, 31.5, 48.5 } /* 15 */ { "_id" : "CZ", "count" : 1.0, 16.2, 49.7 } /* 16 Switzerland */ { "_id" : "CH", "count" : 1.0, 8.5, 47 } /* 17 Zuid-Afrika */ { "_id" : "ZA", "count" : 1.0, 24.2, -30.7 } /* 18 */ { "_id" : "NL", "count" : 1.0, 5.8, 52.33 } /* 19 */ { "_id" : "KR", "count" : 1.0, 127.8, 36.8 } /** { "type": "MultiPoint", "coordinates": [ [ -95.8, 40.33 ], [ 135.8, -25.33 ], [ 100.8, 32.33 ], [ 175.8, -40.33 ], [ 10.8, 50.33 ], [ 10.8, 50.33 ], [ 114, 22.33 ], [ 38.4, 55.5 ], [ -2, 53.33 ], [ 137.8, 36 ], [ -105.8, 55.33 ], [ 3, 47.33 ], [ 9.5, 55.33 ], [ -64.8, 18.35 ], [ 31.5, 48.5 ], [ 16.2, 49.7 ], [ 8.5, 47 ], [ 24.2, -30.7 ], [ 5.8, 52.33 ], [ 127.8, 36.8 ] ] } */