source: other-projects/maori-lang-detection/MoreReading/mongodb.txt@ 33806

Last change on this file since 33806 was 33806, checked in by ak19, 4 years ago

More mongodb querying revealed that excluding tentative product sites (if site has /mi in path and emanates from outside NZ) from sites with numPagesCONTAININGMRI > 0, the result is barely different from just querying numPagesCONTAININGMRI > 0. Sadly, several autotranslated reslts still turned up by briefly checking the domains of the result sets in both cases. So maybe the test excluding tentativeProductSites should be repeated with numPagesINMRI > 0, to see whether that test that can better discriminate between auto-translated and sites with proper Maori language webpages.

File size: 36.0 KB
Line 
1MongoDB
2Installation:
3 https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
4 https://docs.mongodb.com/manual/administration/install-on-linux/
5 https://hevodata.com/blog/install-mongodb-on-ubuntu/
6 https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04
7 CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/
8 FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source
9GUI:
10 https://robomongo.org/
11 Robomongo is Robo 3T now
12
13https://www.tutorialspoint.com/mongodb/mongodb_java.htm
14JAR FILE:
15 http://central.maven.org/maven2/org/mongodb/mongo-java-driver/
16 https://mongodb.github.io/mongo-java-driver/
17
18
19
20https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
21http://www.programmersought.com/article/6500308940/
22
23 52 sudo apt-get install mongodb-clients
24 53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
25
26Failed with
27 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
28 exception: connect failed
29
30This is due to a version incompatibility between Client and mongodb Server.
31The solution is to follow instructions at http://www.programmersought.com/article/6500308940/
32and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
33as below:
34
35 54 sudo apt-get purge mongodb-clients
36 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
37 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
38 57 sudo apt-get update
39 58 sudo apt-get install mongodb-clients
40 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
41(still doesn't work)
42 60 sudo apt-get install -y mongodb-org
43The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server?
44 72 sudo service mongod status
45
46 103 sudo service mongod start
47"mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections
48 104 sudo service mongod status
49 88 sudo service mongod stop
50
51
52DETAILS:
53
54wharariki:[879]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
55
56didn't work with the pwd. Failed with:
57
58 MongoDB shell version: 2.6.10
59 Enter password:
60 connecting to: mongodb://mongodb.cms.waikato.ac.nz:27017
61 2019-11-04T20:02:47.970+1300 Assertion: 13110:HostAndPort: host is empty
62 2019-11-04T20:02:47.970+1300 0x6b75c9 0x659e9f 0x636f69 0x4fa55c 0x501249 0x4fa7f1 0x6006fd 0x5eb869 0x7f7bfbd47d76 0x1f3c10d06362
63 mongo(_ZN5mongo15printStackTraceERSo+0x39) [0x6b75c9]
64 mongo(_ZN5mongo10logContextEPKc+0x21f) [0x659e9f]
65 mongo(_ZN5mongo11msgassertedEiPKc+0xd9) [0x636f69]
66 mongo(_ZN5mongo16ConnectionString12_fillServersENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50c) [0x4fa55c]
67 mongo(_ZN5mongo16ConnectionStringC1ENS0_14ConnectionTypeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_+0x99) [0x501249]
68 mongo(_ZN5mongo16ConnectionString5parseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERS6_+0x201) [0x4fa7f1]
69 mongo(_ZN5mongo17mongoConsExternalEPNS_7V8ScopeERKN2v89ArgumentsE+0x11d) [0x6006fd]
70 mongo(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0xa9) [0x5eb869]
71 /usr/lib/libv8.so.3.14.5(+0x99d76) [0x7f7bfbd47d76]
72 [0x1f3c10d06362]
73 2019-11-04T20:02:47.971+1300 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
74 exception: connect failed
75
76
77This is due to a version incompatibility between Client and mongodb Server.
78Can find client version above. (2.6.10)
79Server version can be found by running the mongo client shell. Doing so without loading a db:
80
81
82 wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb
83 MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION
84 type "help" for help
85 > help
86 db.help() help on db methods
87 db.mycoll.help() help on collection methods
88 sh.help() sharding helpers
89 rs.help() replica set helpers
90 help admin administrative help
91 help connect connecting to a db help
92 help keys key shortcuts
93 help misc misc things to know
94 help mr mapreduce
95
96 show dbs show database names
97 show collections show collections in current database
98 show users show users in current database
99 show profile show most recent system.profile entries with time >= 1ms
100 show logs show the accessible logger names
101 show log [name] prints out the last segment of log in memory, 'global' is default
102 use <db_name> set current database
103 db.foo.find() list objects in collection foo
104 db.foo.find( { a : 1 } ) list objects in foo where a == 1
105 it result of the last line evaluated; use to further iterate
106 DBQuery.shellBatchSize = x set default number of items to display on shell
107 exit quit the mongo shell
108
109 > help connect
110
111 Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options.
112 Additional connections may be opened:
113
114 var x = new Mongo('host[:port]');
115 var mydb = x.getDB('mydb');
116 or
117 var mydb = connect('host[:port]/mydb');
118
119 Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
120
121 Getting help on connect options:
122
123 > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017');
124 > var mydb = x.getDB('anupama');
125
126 > mydb.connect.help()
127 DBCollection help
128 db.connect.find().help() - show DBCursor help
129 db.connect.count()
130 db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
131 db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command
132 db.connect.dataSize()
133 db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' )
134 db.connect.drop() drop the collection
135 db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } )
136 db.connect.dropIndexes()
137 db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
138 db.connect.reIndex()
139 db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
140 e.g. db.connect.find( {x:77} , {name:1, x:1} )
141 db.connect.find(...).count()
142 db.connect.find(...).limit(n)
143 db.connect.find(...).skip(n)
144 db.connect.find(...).sort(...)
145 db.connect.findOne([query])
146 db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
147 db.connect.getDB() get DB object associated with collection
148 db.connect.getPlanCache() get query plan cache associated with collection
149 db.connect.getIndexes()
150 db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
151 db.connect.insert(obj)
152 db.connect.mapReduce( mapFunction , reduceFunction , <optional params> )
153 db.connect.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
154 db.connect.remove(query)
155 db.connect.renameCollection( newName , <dropTarget> ) renames the collection.
156 db.connect.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
157 db.connect.save(obj)
158 db.connect.stats()
159 db.connect.storageSize() - includes free space allocated to this collection
160 db.connect.totalIndexSize() - size in bytes of all the indexes
161 db.connect.totalSize() - storage allocated for all data and indexes
162 db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi
163 db.connect.validate( <full> ) - SLOW
164 db.connect.getShardVersion() - only for use with sharding
165 db.connect.getShardDistribution() - prints statistics about data distribution in the cluster
166 db.connect.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
167 db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
168 db.connect.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
169 db.connect.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
170 > mydb.version()
171 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION
172
173(Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb)
174
175Finally we now know the mongodb server version: 4.0.13
176This version doesn't work with our mongo client (shell) version of 2.6.10.
177
178
179DETAILS OF INSTALLING MONGO-CLIENT AND UPDATING IT, AND INSTALLING MONGODB SERVER:
180
181
182 54 sudo apt-get purge mongodb-clients
183 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
184 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
185 57 sudo apt-get update
186 58 sudo apt-get install mongodb-clients
187 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
188 60 sudo apt-get install -y mongodb-org
189 61 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
190 62 sudo service apache2 status
191 63 sudo service sshd status
192 64 sudo service mongodb status
193 65 sudo service mongo status
194 66 mongod
195 67 mongod --help
196 68 mongod --help | less
197 69 mongod -f /etc/mongod.conf
198 70 sudo mongod -f /etc/mongod.conf
199 71 less /etc/mongod.conf
200 72 sudo service mongod status
201 73 sudo service mongod start
202 74 sudo service mongod status
203 75 ls -l /var/log/mongodb/mongod.log
204 76 sudo rm /var/log/mongodb/mongod.log
205 77 sudo service mongod status
206 78 sudo service mongod start
207 79 sudo service mongod status
208 80 sudo service mongod stop
209 81 ps auxww | grep mongo
210 82 sudo service mongod start
211 83 sudo service mongod status
212 84 ps auxww | grep mongo
213 85 sudo dmsg
214 86 sudo dmesg
215 87 sudo service mongod status
216 88 sudo service mongod stop
217 89 sudo service mongod start
218 90 sudo dmesg
219 91 sudo less /var/log/mongodb/mongod.log
220 92 ls /var/lib/
221 93 ls -ld /var/lib/
222 94 ls -l /var/log/mongodb/mongod.log
223 95 ls -ld /var/lib/
224 96 groups mongodb
225 97 less /etc/mongod.conf
226 98 sudo less /var/log/mongodb/mongod.log
227 99 less /etc/mongod.conf
228 100 ls -l /var/lib/mongodb/
229 101 sudo chown -R mongodb /var/lib/mongodb/
230 102 sudo chgrp -R mongodb /var/lib/mongodb/
231 103 sudo service mongod start
232 104 sudo service mongod status
233 105 history
234
235
236
237MONGO DB ROBO 3T
2381. Download "Double Pack" from https://robomongo.org/
2392. Untar its contents. Then untar the tarball in that.
2403. Run:
241 wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t
242
243===================
244On analytics, vagrant node1, we've installed the mongodb server and client.
245We're able to successfully create collections on here.
246
247
248vagrant@node1:~$ mongo
249MongoDB shell version v4.2.1
250connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
251Implicit session: session { "id" : UUID("87bb585c-4685-47f6-bf89-a93801daeb2d") }
252MongoDB server version: 4.2.1
253Server has startup warnings:
2542019-11-04T07:48:14.197+0000 I STORAGE [initandlisten]
2552019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2562019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** See http://dochub.mongodb.org/core/prodnotes-filesystem
2572019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
2582019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database.
2592019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted.
2602019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
261---
262Enable MongoDB's free cloud-based monitoring service, which will then receive and display
263metrics about your deployment (disk utilization, CPU, operation statistics, etc).
264
265The monitoring data will be available on a MongoDB website with a unique URL accessible to you
266and anyone you share the URL with. MongoDB may use this information to make product
267improvements and to suggest MongoDB products and deployment options to you.
268
269To enable free monitoring, run the following command: db.enableFreeMonitoring()
270To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
271---
272
273> show dbs
274admin 0.000GB
275config 0.000GB
276local 0.000GB
277> use db ateacrawldata
2782019-11-05T05:24:20.155+0000 E QUERY [js] Error: [db ateacrawldata] is not a valid database name :
279Mongo.prototype.getDB@src/mongo/shell/mongo.js:51:12
280getDatabase@src/mongo/shell/session.js:913:28
281DB.prototype.getSiblingDB@src/mongo/shell/db.js:22:12
282shellHelper.use@src/mongo/shell/utils.js:803:10
283shellHelper@src/mongo/shell/utils.js:790:15
284@(shellhelp2):1:1
285> db.createCollection('webpages');
286{ "ok" : 1 }
287> db.webpages.drop();
288... ^C
289
290> db.webpages.drop();
291true
292> use ateacrawldata
293switched to db ateacrawldata
294> db.createCollection('webpages');
295{ "ok" : 1 }
296> show collections
297webpages
298> db.createCollection('websites');
299{ "ok" : 1 }
300>
301
302------------------------
303
304Ask Clint to rename "anupama" database to "ateacrawldata" database following the instructions at:
305 https://stackoverflow.com/questions/9201832/how-do-you-rename-a-mongodb-database
306I don't have permissions to do this.
307Nor do I have permissions to create Mongo collections within a new database that I create, like ateacrawldata.
308I only seem to have rights to the "anupama" database.
309
310
311
312-----------------------
313Vagrant virtual machine Node1 has the mongodb installed.
314
315After doing "vagrant up" on node1 to start node1:
316
317 [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh
318 vagrant@node1:~$ mongo
319 MongoDB shell version v4.2.1
320 connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
321 2019-11-13T09:22:46.996+0000 E QUERY [js] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused :
322 connect@src/mongo/shell/mongo.js:341:17
323 @(connect):2:6
324 2019-11-13T09:22:46.999+0000 F - [main] exception: connect failed
325 2019-11-13T09:22:46.999+0000 E - [main] exiting with code 1
326 vagrant@node1:~$ sudo service mongod status
327 ● mongod.service - MongoDB Database Server
328 Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
329 Active: inactive (dead)
330 Docs: https://docs.mongodb.org/manual
331 vagrant@node1:~$ sudo service mongod start
332 vagrant@node1:~$ sudo service mongod status
333 ● mongod.service - MongoDB Database Server
334 Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
335 Active: active (running) since Wed 2019-11-13 09:24:07 UTC; 2s ago
336 Docs: https://docs.mongodb.org/manual
337 Main PID: 4383 (mongod)
338 Tasks: 32
339 Memory: 199.3M
340 CPU: 754ms
341 CGroup: /system.slice/mongod.service
342 └─4383 /usr/bin/mongod --config /etc/mongod.conf
343
344 Nov 13 09:24:07 node1 systemd[1]: Started MongoDB Database Server.
345 vagrant@node1:~$
346
347
348So now mongodb is running on node1 on localhost:27017.
349
350Next, in another x-term connected to analytics' node1 Vagrant VM, port forward node1's localhost:27017 to analytics' localhost:27017:
351 vagrant ssh -- -L 27017:localhost:27017
352
353
354
355Finally, in another x-term, port-forward from analytics:27017 to current machine's 27017:
356 ssh -L 27017:localhost:27017 analytics
357
358
359Now can connect Robo-3T running on current machine to localhost:27017.
360
361Then in a new x-term, can use the client mongo shell to connect (by default to localhost:27017):
362
363 wharariki:[122]/Scratch/ak19/GS309>mongo --shell
364 MongoDB shell version v4.0.13
365 connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
366 ...
367 > show dbs
368 admin 0.000GB
369 ateacrawldata 1.532GB
370 config 0.000GB
371 local 0.000GB
372 > use ateacrawldata
373
374 > show collections
375 Webpages
376 Websites
377 oldwebpages
378 oldwebsites
379-------------------
380
381Country code to geolocation CSV file found by Dr Bainbridge:
382https://developers.google.com/public-data/docs/canonical/countries_csv
383
384Import into mongodb with:
385https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv
386
387
388
389NOTE: mongoimport is a commandline utility and not a command to be run from the mongo shell. See https://jira.mongodb.org/browse/DOCS-11072
390This means, in an x-term, DON'T RUN MONGO SHELL/client first. Instead, directly from x-term, run the following to import the countrycodes.csv file:
391
392
393 mongoimport -d ateacrawldata -c countrylocations --type csv --file /Scratch/ak19/maori-lang-detection/MoreReading/countrycodes.csv --headerline
394
395
396-------------------------
397
398MONGODB QUERIES:
399
400db.getCollection('webpages').find({"isMRI": true, "singleSentences.langCode": "mri"})
401db.getCollection('webpages').find({"singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"})
402db.getCollection('Webpages').find({"isMRI": true, "singleSentences": { $elemMatch: {"langCode":"eng"} } }, {"singleSentences.$": "eng"}) [single English lang sentence]
403db.getCollection('Webpages').find({"containsMRI": true, "singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) [gets 1st sentence of docs which have sentences containing MRI]
404
405
406READING
407
408mongodb java convert class
409https://www.quora.com/What-are-the-ways-of-converting-a-Java-object-to-a-MongoDB-document-and-vice-versa
410https://stackoverflow.com/questions/39320825/pojo-to-org-bson-document-and-vice-versa
411X https://mongodb.github.io/morphia/
412https://stackoverflow.com/questions/10170506/inserting-java-object-to-mongodb-collection-using-java
413X https://www.google.com/search?q=morphia+example&oq=morphia+example&aqs=chrome.0.0l6.4223j0j9&sourceid=chrome&ie=UTF-8
414https://www.baeldung.com/mongodb-morphia
415X https://web.archive.org/web/20171117121335/http://mongodb.github.io/morphia/1.3/getting-started/
416=> https://morphia.dev/1.4/getting-started/quick-tour/
417https://github.com/MorphiaOrg/morphia/tree/master/docs/reference
418
419
420mongodb querying
421https://docs.mongodb.com/manual/tutorial/query-embedded-documents/
422https://docs.mongodb.com/manual/tutorial/query-arrays/
423https://www.google.com/search?q=mongodb+find+subdocument&oq=mongodb+find+&aqs=chrome.0.69i59j69i57j0l4.7607j1j8&sourceid=chrome&ie=UTF-8
424https://stackoverflow.com/questions/25586901/how-to-find-document-and-single-subdocument-matching-given-criterias-in-mongodb
425https://stackoverflow.com/questions/21113543/mongodb-get-subdocument
426https://stackoverflow.com/questions/36948856/find-subdocuments-in-mongo
427https://docs.mongodb.com/v3.0/reference/operator/projection/positional/#proj._S_
428https://www.google.com/search?q=mongodb+query+tutorial&oq=mongodb+query+tutorial&aqs=chrome..69i57j0l2j69i60l3.4719j0j7&sourceid=chrome&ie=UTF-8
429https://blog.exploratory.io/an-introduction-to-mongodb-query-for-beginners-bd463319aa4c
430https://docs.mongodb.com/manual/reference/method/db.collection.find/
431https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection
432https://stackoverflow.com/questions/39641925/mongodb-aggregation-framework-to-get-frequencies-of-fields-values
433
434https://exploratory.io/note/kanaugust/0961813761939766
435https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/
436https://docs.mongodb.com/manual/aggregation/
437
438
439Mongo Studio 3T documentation:
440https://studio3t.com/download/ (also has uninstall information)
441https://studio3t.com/download-thank-you/?OS=x64
442
443Google: MongoDB visualization
444MongoDB visualization map
445MongoDB Charts
446 (Open source visualisation tools)
447
448json map visualizer
449 geojson.tools
450-------------------
451
452Some queries with results:
453
454# Num websites
455db.getCollection('Websites').find({}).count()
4561445
457
458# Num webpages
459db.getCollection('Webpages').find({}).count()
460X75139
461117496
462
463# Find number of websites who have 1 or more pages in Maori (a positive numPagesInMRI)
464db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
465361
466
467# Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
468db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
469868
470
471# Obviously, the union of the above two will be identical to numPagesContainingMRI:
472db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
473868
474
475# Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
476db.getCollection('Webpages').find({isMRI:true}).count()
477X5224
478X5215
479db.getCollection('Webpages').find({isMRI:true}).count()
4807818
481
482# Number of pages that contain any number of MRI sentences
483db.getCollection('Webpages').find({containsMRI: true}).count()
484X12858
48520371
486
487
488# Number of sites with URLs containing /mi(/)
489db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
490153
491
492# Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls
493db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
494147
495
496# 5 sites with URLs containing /mi(/) that are in NZ
497db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count()
4986
499
500
501# sort websites that contain /mi(/) in path by geoLocationCountryCode
502# https://www.quackit.com/mongodb/tutorial/mongodb_sort_query_results.cfm
503db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).sort({geoLocationCountryCode: 1})
504
505Actually, I want to sort by count. See https://docs.mongodb.com/manual/reference/operator/aggregation/sortByCount/
506
507
508# PROJECTION:
509db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInPath: 1})
510
511https://docs.mongodb.com/manual/aggregation/
512EXAMPLE:
513db.orders.aggregate([
514 { $match: { status: "A" } },
515 { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
516])
517
518X db.Websites.aggregate([{ $match:{urlContainsLangCodeInPath:true}}, $group: {geoLocationCountryCode:1, total: $count}])
519
520
521X db.Websites.aggregate([
522 { $match:{urlContainsLangCodeInPath:true}},
523 {$group: {geoLocationCountryCode:1}}
524])
525
526WORKS (but an "unwind" will get rid of "null"):
527db.Websites.aggregate([
528 { $match:{urlContainsLangCodeInPath:true}},
529 {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}}},
530 { $sort : { count : -1} }
531])
532
533
534# COUNT OF ALL GEOLOCATION COUNTRIES
535#https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
536 # LIST
537 db.Websites.distinct('geoLocationCountryCode');
538
539 # COUNT
540 db.Websites.distinct('geoLocationCountryCode').length;
541
542 # A COUNT WITH QUERY - https://docs.mongodb.com/manual/reference/command/distinct/#dbcmd.distinct
543
544 db.runCommand ( { distinct: "Websites", key: "geoLocationCountryCode", query: { "urlContainsLangCodeInPath": true} } );
545
546 # DISTINCT WITH QUERY WITHOUT COUNT - https://docs.mongodb.com/manual/reference/method/db.collection.distinct/
547 db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true});
548
549 #SORTED - https://stackoverflow.com/questions/4759437/get-distinct-values-with-sorted-data
550 db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}).sort();
551
552
553 # count of all sites for which the geolocation is UNKNOWN
554 db.getCollection('Websites').find({geoLocationCountryCode: {$eq:"UNKNOWN"}}).count()
555
556
557# AGGREGATION QUERIES THAT WORK:
558#https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
559
560WORKS:
561// count of country codes for all sites
562db.Websites.aggregate([
563
564 { $unwind: "$geoLocationCountryCode" },
565 {
566 $group: {
567 _id: "$geoLocationCountryCode",
568 count: { $sum: 1 }
569 }
570 },
571 { $sort : { count : -1} }
572]);
573
574// count of country codes for sites that have at least one page detected as MRI
575
576db.Websites.aggregate([
577 {
578 $match: {
579 numPagesInMRI: {$gt: 0}
580 }
581 },
582 { $unwind: "$geoLocationCountryCode" },
583 {
584 $group: {
585 _id: {$toLower: '$geoLocationCountryCode'},
586 count: { $sum: 1 }
587 }
588 },
589 { $sort : { count : -1} }
590]);
591
592// count of country codes for sites that have at least one page containing at least one sentence detected as MRI
593db.Websites.aggregate([
594 {
595 $match: {
596 numPagesContainingMRI: {$gt: 0}
597 }
598 },
599 { $unwind: "$geoLocationCountryCode" },
600 {
601 $group: {
602 _id: {$toLower: '$geoLocationCountryCode'},
603 count: { $sum: 1 }
604 }
605 },
606 { $sort : { count : -1} }
607]);
608
609
610WORKS:
611// count of country codes for sites that have /mi(/) in path
612
613db.Websites.aggregate([
614 {
615 $match: {
616 urlContainsLangCodeInPath: true
617 }
618 },
619 { $unwind: "$geoLocationCountryCode" },
620 {
621 $group: {
622 _id: {$toLower: '$geoLocationCountryCode'},
623 count: { $sum: 1 }
624 }
625 },
626 { $sort : { count : -1} }
627]);
628
629
630WORKS:
631db.Websites.aggregate([
632 {
633 $match: {
634 geoLocationCountryCode: {$ne : "UNKNOWN"}
635 }
636 },
637 { $unwind: "$geoLocationCountryCode" },
638 {
639 $group: {
640 _id: "$geoLocationCountryCode",
641 count: { $sum: 1 }
642 }
643 },
644 { $sort : { count : -1} }
645]);
646
647WORKS:
648db.Websites.aggregate([
649 {
650 $match: {
651 "urlContainsLangCodeInPath": true
652 }
653 },
654 { $unwind: "$geoLocationCountryCode" },
655 {
656 $group: {
657 _id: "$geoLocationCountryCode",
658 count: { $sum: 1 }
659 }
660 },
661 { $sort : { count : -1} }
662]);
663
664
665KEEP ADDITIONAL FIELDS - https://stackoverflow.com/questions/16662405/mongo-group-query-how-to-keep-fields:
666 a. KEEPS ONLY FIRST DOMAIN URL FOR EACH COUNTED COUNTRY CODE:
667
668 db.Websites.aggregate([
669 {
670 $match: {
671 "urlContainsLangCodeInPath": true
672 }
673 },
674 { $unwind: "$geoLocationCountryCode" },
675 {
676 $group: {
677 _id: "$geoLocationCountryCode", count: { $sum: 1 },
678 domain: {$first: '$domain'}
679 }
680 },
681 { $sort : { count : -1} }
682 ]);
683
684 b. KEEP ALL DOMAIN URLS:
685 db.Websites.aggregate([
686 {
687 $match: {
688 "urlContainsLangCodeInPath": true
689 }
690 },
691 { $unwind: "$geoLocationCountryCode" },
692 {
693 $group: {
694 _id: "$geoLocationCountryCode",
695 count: { $sum: 1 },
696 domain: { $addToSet: '$domain' }
697 }
698 },
699 { $sort : { count : -1} }
700 ]);
701
702
703# WANT TO GET THE ABOVE INTO WORLD MAP, use geojson.tools found by Dr Bainbridge
704geojson.tools
705USAGE: https://www.here.xyz/viewer-tool/
706
707
708AIMS:
709* Identify where Maori language is online.
710* How can we identify high quality sites that would be good for a corpus.
711(Related work for other languages to quantifiably answer that)
712
713data-preparation
714docs
715
716
717------------------------------------------
718
719BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
720---
721
722# https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or
723# https://docs.mongodb.com/manual/reference/operator/query/and/
724
725# 1. all the websites which are from NZ:
726db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count()
727128
728
729# 2. all the websites that have /mi in URL path which are from NZ:
730db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]})
7316
732
733# 3. all the websites that don't have /mi in URLpath
734db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count()
7351292
736
737# 4. all the websites that don't have /mi, or if they do are from NZ
738# (should be the sum of the above points 2 and 3 above)
739db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count()
7401298
741
742# 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ
743# These are the TENTATIVE NON-PRODUCT SITES
744# Should be less than the point 4, but more than 1 to 3
745db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count()
746859
747
748# 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix)
749# Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
750db.Websites.aggregate([
751 {
752 $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}
753 },
754 { $unwind: "$geoLocationCountryCode" },
755 {
756 $group: {
757 _id: {$toLower: '$geoLocationCountryCode'},
758 count: { $sum: 1 }
759 }
760 },
761 { $sort : { count : -1} }
762]);
763
764The result is very close to the same aggregate on just numPagesContainingMRI.
765
766That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few:
767
768db.Websites.aggregate([
769 {
770 $match: {
771 $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]
772 }
773 },
774 { $unwind: "$geoLocationCountryCode" },
775 {
776 $group: {
777 _id: {$toLower: '$geoLocationCountryCode'},
778 count: { $sum: 1 }
779 }
780 },
781 { $sort : { count : -1} }
782]);
783
784
785_id count
786us 4.0
787nz 4.0
788au 3.0
789ru 1.0
790de 1.0
791
792Total: 13 sites that have /mi/ and are detected as having MRI content,
793db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
79413
795
796Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI.
797
798
799Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD!
800/* 1 */
801{
802 "_id" : "nz",
803 "count" : 4.0,
804 "domain" : [
805 "http://firstworldwar.tki.org.nz",
806 "http://www.firstworldwar.tki.org.nz",
807 "https://admin.teara.govt.nz",
808 "http://community.nzdl.org"
809 ]
810}
811
812/* 2 */
813{
814 "_id" : "us",
815 "count" : 4.0,
816 "domain" : [
817 "https://sexualviolence.victimsinfo.govt.nz",
818 "https://follow3rs.com",
819 "http://www.church-of-christ.org",
820 "http://www.mytrickstips.com"
821 ]
822}
823
824/* 3 */
825{
826 "_id" : "au",
827 "count" : 3.0,
828 "domain" : [
829 "https://rapuatearatika.education.govt.nz",
830 "https://www.kiwiproperty.com",
831 "https://curriculumtool.education.govt.nz"
832 ]
833}
834
835/* 4 */
836{
837 "_id" : "ru",
838 "count" : 1.0,
839 "domain" : [
840 "http://www.treningmozga.com"
841 ]
842}
843
844/* 5 */
845{
846 "_id" : "de",
847 "count" : 1.0,
848 "domain" : [
849 "http://www.almancax.com" # Website to learn German, autotranslated
850 ]
851}
852
853
854But we're not catching a potentially large number of auto-translated sites, like
855- https://www.gigalight.com/all-languages.html
856- http://www.hzhinew.com/
857
858
859--------------
860GETTING TABLE DATA OUT OF MONGO DB:
861
862https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo
863"export to file" as in a spreadsheet like to a .csv?
864
865IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo):
866
867 1. In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything
868
869 2. paste everything into this website: https://json-csv.com/
870
871 3. click the download button and now you have it in a spreadsheet.
872
873
874https://json-csv.com/
875
876
877---------------------
878
879/* 1 */
880{
881 "_id" : "US",
882 "count" : 93.0,
883 -95.8,40.33
884}
885
886/* 2 */
887{
888 "_id" : "AU",
889 "count" : 7.0,
890 135.8,-25.33
891}
892
893/* 3 */
894{
895 "_id" : "CN",
896 "count" : 7.0,
897 100.8,
898 32.33
899}
900
901/* 4 */
902{
903 "_id" : "NZ",
904 "count" : 5.0,
905175.8,
906 -40.33
907}
908
909/* 5 */
910{
911 "_id" : "DE",
912 "count" : 5.0,
91310.8,
914 50.33
915}
916
917/* 6 */
918{
919 "_id" : "HK",
920 "count" : 5.0,
921114,
922 22.33
923}
924
925/* 7 */
926{
927 "_id" : "RU",
928 "count" : 4.0,
92938.4,
930 55.5
931}
932
933/* 8 */
934{
935 "_id" : "JP",
936 "count" : 3.0,
937 137.8,
938 36
939}
940
941/* 9 */
942{
943 "_id" : "GB",
944 "count" : 3.0,
945-2,
946 53.33
947}
948
949/* 10 */
950{
951 "_id" : "CA",
952 "count" : 2.0,
953 -105.8,
954 55.33
955}
956
957/* 11 */
958{
959 "_id" : "FR",
960 "count" : 2.0,
961 3,
962 47.33
963}
964
965/* 12 */
966{
967 "_id" : "DK",
968 "count" : 2.0,
969 9.5,
970 55.33
971}
972
973/* 13 British Virgin Islands */
974{
975 "_id" : "VG",
976 "count" : 2.0,
977 -64.8,
978 18.35
979}
980
981/* 14 Ukraine */
982{
983 "_id" : "UA",
984 "count" : 1.0,
985 31.5,
986 48.5
987}
988
989/* 15 */
990{
991 "_id" : "CZ",
992 "count" : 1.0,
993 16.2,
994 49.7
995}
996
997/* 16 Switzerland */
998{
999 "_id" : "CH",
1000 "count" : 1.0,
1001 8.5,
1002 47
1003}
1004
1005/* 17 Zuid-Afrika */
1006{
1007 "_id" : "ZA",
1008 "count" : 1.0,
1009 24.2,
1010 -30.7
1011}
1012
1013/* 18 */
1014{
1015 "_id" : "NL",
1016 "count" : 1.0,
10175.8,
1018 52.33
1019}
1020
1021/* 19 */
1022{
1023 "_id" : "KR",
1024 "count" : 1.0,
1025 127.8,
1026 36.8
1027}
1028
1029
1030/** http://geojson.tools/
1031
1032
1033{
1034 "type": "MultiPoint",
1035 "coordinates": [
1036 [
1037 -95.8,
1038 40.33
1039 ],
1040 [
1041 135.8,
1042 -25.33
1043 ],
1044 [
1045 100.8,
1046 32.33
1047 ],
1048 [
1049 175.8,
1050 -40.33
1051 ],
1052 [
1053 10.8,
1054 50.33
1055 ],
1056 [
1057 10.8,
1058 50.33
1059 ],
1060 [
1061 114,
1062 22.33
1063 ],
1064 [
1065 38.4,
1066 55.5
1067 ],
1068 [
1069 -2,
1070 53.33
1071 ],
1072 [
1073 137.8,
1074 36
1075 ],
1076 [
1077 -105.8,
1078 55.33
1079 ],
1080 [
1081 3,
1082 47.33
1083 ],
1084 [
1085 9.5,
1086 55.33
1087 ],
1088 [
1089 -64.8,
1090 18.35
1091 ],
1092 [
1093 31.5,
1094 48.5
1095 ],
1096 [
1097 16.2,
1098 49.7
1099 ],
1100 [
1101 8.5,
1102 47
1103 ],
1104 [
1105 24.2,
1106 -30.7
1107 ],
1108 [
1109 5.8,
1110 52.33
1111 ],
1112 [
1113 127.8,
1114 36.8
1115 ]
1116 ]
1117}
1118
1119*/
1120
1121/* 1 */
1122{
1123 "_id" : "US",
1124 "count" : 93.0,
1125 -95.8,40.33
1126}
1127
1128/* 2 */
1129{
1130 "_id" : "AU",
1131 "count" : 7.0,
1132 135.8,-25.33
1133}
1134
1135/* 3 */
1136{
1137 "_id" : "CN",
1138 "count" : 7.0,
1139 100.8,
1140 32.33
1141}
1142
1143/* 4 */
1144{
1145 "_id" : "NZ",
1146 "count" : 5.0,
1147175.8,
1148 -40.33
1149}
1150
1151/* 5 */
1152{
1153 "_id" : "DE",
1154 "count" : 5.0,
115510.8,
1156 50.33
1157}
1158
1159/* 6 */
1160{
1161 "_id" : "HK",
1162 "count" : 5.0,
1163114,
1164 22.33
1165}
1166
1167/* 7 */
1168{
1169 "_id" : "RU",
1170 "count" : 4.0,
117138.4,
1172 55.5
1173}
1174
1175/* 8 */
1176{
1177 "_id" : "JP",
1178 "count" : 3.0,
1179 137.8,
1180 36
1181}
1182
1183/* 9 */
1184{
1185 "_id" : "GB",
1186 "count" : 3.0,
1187-2,
1188 53.33
1189}
1190
1191/* 10 */
1192{
1193 "_id" : "CA",
1194 "count" : 2.0,
1195 -105.8,
1196 55.33
1197}
1198
1199/* 11 */
1200{
1201 "_id" : "FR",
1202 "count" : 2.0,
1203 3,
1204 47.33
1205}
1206
1207/* 12 */
1208{
1209 "_id" : "DK",
1210 "count" : 2.0,
1211 9.5,
1212 55.33
1213}
1214
1215/* 13 British Virgin Islands */
1216{
1217 "_id" : "VG",
1218 "count" : 2.0,
1219 -64.8,
1220 18.35
1221}
1222
1223/* 14 Ukraine */
1224{
1225 "_id" : "UA",
1226 "count" : 1.0,
1227 31.5,
1228 48.5
1229}
1230
1231/* 15 */
1232{
1233 "_id" : "CZ",
1234 "count" : 1.0,
1235 16.2,
1236 49.7
1237}
1238
1239/* 16 Switzerland */
1240{
1241 "_id" : "CH",
1242 "count" : 1.0,
1243 8.5,
1244 47
1245}
1246
1247/* 17 Zuid-Afrika */
1248{
1249 "_id" : "ZA",
1250 "count" : 1.0,
1251 24.2,
1252 -30.7
1253}
1254
1255/* 18 */
1256{
1257 "_id" : "NL",
1258 "count" : 1.0,
12595.8,
1260 52.33
1261}
1262
1263/* 19 */
1264{
1265 "_id" : "KR",
1266 "count" : 1.0,
1267 127.8,
1268 36.8
1269}
1270
1271
1272/** http://geojson.tools/
1273
1274
1275{
1276 "type": "MultiPoint",
1277 "coordinates": [
1278 [
1279 -95.8,
1280 40.33
1281 ],
1282 [
1283 135.8,
1284 -25.33
1285 ],
1286 [
1287 100.8,
1288 32.33
1289 ],
1290 [
1291 175.8,
1292 -40.33
1293 ],
1294 [
1295 10.8,
1296 50.33
1297 ],
1298 [
1299 10.8,
1300 50.33
1301 ],
1302 [
1303 114,
1304 22.33
1305 ],
1306 [
1307 38.4,
1308 55.5
1309 ],
1310 [
1311 -2,
1312 53.33
1313 ],
1314 [
1315 137.8,
1316 36
1317 ],
1318 [
1319 -105.8,
1320 55.33
1321 ],
1322 [
1323 3,
1324 47.33
1325 ],
1326 [
1327 9.5,
1328 55.33
1329 ],
1330 [
1331 -64.8,
1332 18.35
1333 ],
1334 [
1335 31.5,
1336 48.5
1337 ],
1338 [
1339 16.2,
1340 49.7
1341 ],
1342 [
1343 8.5,
1344 47
1345 ],
1346 [
1347 24.2,
1348 -30.7
1349 ],
1350 [
1351 5.8,
1352 52.33
1353 ],
1354 [
1355 127.8,
1356 36.8
1357 ]
1358 ]
1359}
1360
1361*/
Note: See TracBrowser for help on using the repository browser.