1 | MongoDB
|
---|
2 | Installation:
|
---|
3 | https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
4 | https://docs.mongodb.com/manual/administration/install-on-linux/
|
---|
5 | https://hevodata.com/blog/install-mongodb-on-ubuntu/
|
---|
6 | https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04
|
---|
7 | CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/
|
---|
8 | FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source
|
---|
9 | GUI:
|
---|
10 | https://robomongo.org/
|
---|
11 | Robomongo is Robo 3T now
|
---|
12 |
|
---|
13 | https://www.tutorialspoint.com/mongodb/mongodb_java.htm
|
---|
14 | JAR FILE:
|
---|
15 | http://central.maven.org/maven2/org/mongodb/mongo-java-driver/
|
---|
16 | https://mongodb.github.io/mongo-java-driver/
|
---|
17 |
|
---|
18 |
|
---|
19 |
|
---|
20 | https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
21 | http://www.programmersought.com/article/6500308940/
|
---|
22 |
|
---|
23 | 52 sudo apt-get install mongodb-clients
|
---|
24 | 53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
25 |
|
---|
26 | Failed with
|
---|
27 | Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
|
---|
28 | exception: connect failed
|
---|
29 |
|
---|
30 | This is due to a version incompatibility between Client and mongodb Server.
|
---|
31 | The solution is to follow instructions at http://www.programmersought.com/article/6500308940/
|
---|
32 | and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
33 | as below:
|
---|
34 |
|
---|
35 | 54 sudo apt-get purge mongodb-clients
|
---|
36 | 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
|
---|
37 | 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
---|
38 | 57 sudo apt-get update
|
---|
39 | 58 sudo apt-get install mongodb-clients
|
---|
40 | 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
41 | (still doesn't work)
|
---|
42 | 60 sudo apt-get install -y mongodb-org
|
---|
43 | The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server?
|
---|
44 | 72 sudo service mongod status
|
---|
45 |
|
---|
46 | 103 sudo service mongod start
|
---|
47 | "mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections
|
---|
48 | 104 sudo service mongod status
|
---|
49 | 88 sudo service mongod stop
|
---|
50 |
|
---|
51 |
|
---|
52 | DETAILS:
|
---|
53 |
|
---|
54 | wharariki:[879]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
55 |
|
---|
56 | didn't work with the pwd. Failed with:
|
---|
57 |
|
---|
58 | MongoDB shell version: 2.6.10
|
---|
59 | Enter password:
|
---|
60 | connecting to: mongodb://mongodb.cms.waikato.ac.nz:27017
|
---|
61 | 2019-11-04T20:02:47.970+1300 Assertion: 13110:HostAndPort: host is empty
|
---|
62 | 2019-11-04T20:02:47.970+1300 0x6b75c9 0x659e9f 0x636f69 0x4fa55c 0x501249 0x4fa7f1 0x6006fd 0x5eb869 0x7f7bfbd47d76 0x1f3c10d06362
|
---|
63 | mongo(_ZN5mongo15printStackTraceERSo+0x39) [0x6b75c9]
|
---|
64 | mongo(_ZN5mongo10logContextEPKc+0x21f) [0x659e9f]
|
---|
65 | mongo(_ZN5mongo11msgassertedEiPKc+0xd9) [0x636f69]
|
---|
66 | mongo(_ZN5mongo16ConnectionString12_fillServersENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50c) [0x4fa55c]
|
---|
67 | mongo(_ZN5mongo16ConnectionStringC1ENS0_14ConnectionTypeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_+0x99) [0x501249]
|
---|
68 | mongo(_ZN5mongo16ConnectionString5parseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERS6_+0x201) [0x4fa7f1]
|
---|
69 | mongo(_ZN5mongo17mongoConsExternalEPNS_7V8ScopeERKN2v89ArgumentsE+0x11d) [0x6006fd]
|
---|
70 | mongo(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0xa9) [0x5eb869]
|
---|
71 | /usr/lib/libv8.so.3.14.5(+0x99d76) [0x7f7bfbd47d76]
|
---|
72 | [0x1f3c10d06362]
|
---|
73 | 2019-11-04T20:02:47.971+1300 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
|
---|
74 | exception: connect failed
|
---|
75 |
|
---|
76 |
|
---|
77 | This is due to a version incompatibility between Client and mongodb Server.
|
---|
78 | Can find client version above. (2.6.10)
|
---|
79 | Server version can be found by running the mongo client shell. Doing so without loading a db:
|
---|
80 |
|
---|
81 |
|
---|
82 | wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb
|
---|
83 | MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION
|
---|
84 | type "help" for help
|
---|
85 | > help
|
---|
86 | db.help() help on db methods
|
---|
87 | db.mycoll.help() help on collection methods
|
---|
88 | sh.help() sharding helpers
|
---|
89 | rs.help() replica set helpers
|
---|
90 | help admin administrative help
|
---|
91 | help connect connecting to a db help
|
---|
92 | help keys key shortcuts
|
---|
93 | help misc misc things to know
|
---|
94 | help mr mapreduce
|
---|
95 |
|
---|
96 | show dbs show database names
|
---|
97 | show collections show collections in current database
|
---|
98 | show users show users in current database
|
---|
99 | show profile show most recent system.profile entries with time >= 1ms
|
---|
100 | show logs show the accessible logger names
|
---|
101 | show log [name] prints out the last segment of log in memory, 'global' is default
|
---|
102 | use <db_name> set current database
|
---|
103 | db.foo.find() list objects in collection foo
|
---|
104 | db.foo.find( { a : 1 } ) list objects in foo where a == 1
|
---|
105 | it result of the last line evaluated; use to further iterate
|
---|
106 | DBQuery.shellBatchSize = x set default number of items to display on shell
|
---|
107 | exit quit the mongo shell
|
---|
108 |
|
---|
109 | > help connect
|
---|
110 |
|
---|
111 | Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options.
|
---|
112 | Additional connections may be opened:
|
---|
113 |
|
---|
114 | var x = new Mongo('host[:port]');
|
---|
115 | var mydb = x.getDB('mydb');
|
---|
116 | or
|
---|
117 | var mydb = connect('host[:port]/mydb');
|
---|
118 |
|
---|
119 | Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
|
---|
120 |
|
---|
121 | Getting help on connect options:
|
---|
122 |
|
---|
123 | > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017');
|
---|
124 | > var mydb = x.getDB('anupama');
|
---|
125 |
|
---|
126 | > mydb.connect.help()
|
---|
127 | DBCollection help
|
---|
128 | db.connect.find().help() - show DBCursor help
|
---|
129 | db.connect.count()
|
---|
130 | db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
|
---|
131 | db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command
|
---|
132 | db.connect.dataSize()
|
---|
133 | db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' )
|
---|
134 | db.connect.drop() drop the collection
|
---|
135 | db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } )
|
---|
136 | db.connect.dropIndexes()
|
---|
137 | db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
|
---|
138 | db.connect.reIndex()
|
---|
139 | db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
|
---|
140 | e.g. db.connect.find( {x:77} , {name:1, x:1} )
|
---|
141 | db.connect.find(...).count()
|
---|
142 | db.connect.find(...).limit(n)
|
---|
143 | db.connect.find(...).skip(n)
|
---|
144 | db.connect.find(...).sort(...)
|
---|
145 | db.connect.findOne([query])
|
---|
146 | db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
|
---|
147 | db.connect.getDB() get DB object associated with collection
|
---|
148 | db.connect.getPlanCache() get query plan cache associated with collection
|
---|
149 | db.connect.getIndexes()
|
---|
150 | db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
|
---|
151 | db.connect.insert(obj)
|
---|
152 | db.connect.mapReduce( mapFunction , reduceFunction , <optional params> )
|
---|
153 | db.connect.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
|
---|
154 | db.connect.remove(query)
|
---|
155 | db.connect.renameCollection( newName , <dropTarget> ) renames the collection.
|
---|
156 | db.connect.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
|
---|
157 | db.connect.save(obj)
|
---|
158 | db.connect.stats()
|
---|
159 | db.connect.storageSize() - includes free space allocated to this collection
|
---|
160 | db.connect.totalIndexSize() - size in bytes of all the indexes
|
---|
161 | db.connect.totalSize() - storage allocated for all data and indexes
|
---|
162 | db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi
|
---|
163 | db.connect.validate( <full> ) - SLOW
|
---|
164 | db.connect.getShardVersion() - only for use with sharding
|
---|
165 | db.connect.getShardDistribution() - prints statistics about data distribution in the cluster
|
---|
166 | db.connect.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
|
---|
167 | db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
|
---|
168 | db.connect.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
|
---|
169 | db.connect.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
|
---|
170 | > mydb.version()
|
---|
171 | 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION
|
---|
172 |
|
---|
173 | (Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb)
|
---|
174 |
|
---|
175 | Finally we now know the mongodb server version: 4.0.13
|
---|
176 | This version doesn't work with our mongo client (shell) version of 2.6.10.
|
---|
177 |
|
---|
178 |
|
---|
179 | DETAILS OF INSTALLING MONGO-CLIENT AND UPDATING IT, AND INSTALLING MONGODB SERVER:
|
---|
180 |
|
---|
181 |
|
---|
182 | 54 sudo apt-get purge mongodb-clients
|
---|
183 | 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
|
---|
184 | 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
---|
185 | 57 sudo apt-get update
|
---|
186 | 58 sudo apt-get install mongodb-clients
|
---|
187 | 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
188 | 60 sudo apt-get install -y mongodb-org
|
---|
189 | 61 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
190 | 62 sudo service apache2 status
|
---|
191 | 63 sudo service sshd status
|
---|
192 | 64 sudo service mongodb status
|
---|
193 | 65 sudo service mongo status
|
---|
194 | 66 mongod
|
---|
195 | 67 mongod --help
|
---|
196 | 68 mongod --help | less
|
---|
197 | 69 mongod -f /etc/mongod.conf
|
---|
198 | 70 sudo mongod -f /etc/mongod.conf
|
---|
199 | 71 less /etc/mongod.conf
|
---|
200 | 72 sudo service mongod status
|
---|
201 | 73 sudo service mongod start
|
---|
202 | 74 sudo service mongod status
|
---|
203 | 75 ls -l /var/log/mongodb/mongod.log
|
---|
204 | 76 sudo rm /var/log/mongodb/mongod.log
|
---|
205 | 77 sudo service mongod status
|
---|
206 | 78 sudo service mongod start
|
---|
207 | 79 sudo service mongod status
|
---|
208 | 80 sudo service mongod stop
|
---|
209 | 81 ps auxww | grep mongo
|
---|
210 | 82 sudo service mongod start
|
---|
211 | 83 sudo service mongod status
|
---|
212 | 84 ps auxww | grep mongo
|
---|
213 | 85 sudo dmsg
|
---|
214 | 86 sudo dmesg
|
---|
215 | 87 sudo service mongod status
|
---|
216 | 88 sudo service mongod stop
|
---|
217 | 89 sudo service mongod start
|
---|
218 | 90 sudo dmesg
|
---|
219 | 91 sudo less /var/log/mongodb/mongod.log
|
---|
220 | 92 ls /var/lib/
|
---|
221 | 93 ls -ld /var/lib/
|
---|
222 | 94 ls -l /var/log/mongodb/mongod.log
|
---|
223 | 95 ls -ld /var/lib/
|
---|
224 | 96 groups mongodb
|
---|
225 | 97 less /etc/mongod.conf
|
---|
226 | 98 sudo less /var/log/mongodb/mongod.log
|
---|
227 | 99 less /etc/mongod.conf
|
---|
228 | 100 ls -l /var/lib/mongodb/
|
---|
229 | 101 sudo chown -R mongodb /var/lib/mongodb/
|
---|
230 | 102 sudo chgrp -R mongodb /var/lib/mongodb/
|
---|
231 | 103 sudo service mongod start
|
---|
232 | 104 sudo service mongod status
|
---|
233 | 105 history
|
---|
234 |
|
---|
235 |
|
---|
236 |
|
---|
237 | MONGO DB ROBO 3T
|
---|
238 | 1. Download "Double Pack" from https://robomongo.org/
|
---|
239 | 2. Untar its contents. Then untar the tarball in that.
|
---|
240 | 3. Run:
|
---|
241 | wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t
|
---|
242 |
|
---|
243 | ===================
|
---|
244 | On analytics, vagrant node1, we've installed the mongodb server and client.
|
---|
245 | We're able to successfully create collections on here.
|
---|
246 |
|
---|
247 |
|
---|
248 | vagrant@node1:~$ mongo
|
---|
249 | MongoDB shell version v4.2.1
|
---|
250 | connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
|
---|
251 | Implicit session: session { "id" : UUID("87bb585c-4685-47f6-bf89-a93801daeb2d") }
|
---|
252 | MongoDB server version: 4.2.1
|
---|
253 | Server has startup warnings:
|
---|
254 | 2019-11-04T07:48:14.197+0000 I STORAGE [initandlisten]
|
---|
255 | 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
|
---|
256 | 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** See http://dochub.mongodb.org/core/prodnotes-filesystem
|
---|
257 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
|
---|
258 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database.
|
---|
259 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted.
|
---|
260 | 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten]
|
---|
261 | ---
|
---|
262 | Enable MongoDB's free cloud-based monitoring service, which will then receive and display
|
---|
263 | metrics about your deployment (disk utilization, CPU, operation statistics, etc).
|
---|
264 |
|
---|
265 | The monitoring data will be available on a MongoDB website with a unique URL accessible to you
|
---|
266 | and anyone you share the URL with. MongoDB may use this information to make product
|
---|
267 | improvements and to suggest MongoDB products and deployment options to you.
|
---|
268 |
|
---|
269 | To enable free monitoring, run the following command: db.enableFreeMonitoring()
|
---|
270 | To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
|
---|
271 | ---
|
---|
272 |
|
---|
273 | > show dbs
|
---|
274 | admin 0.000GB
|
---|
275 | config 0.000GB
|
---|
276 | local 0.000GB
|
---|
277 | > use db ateacrawldata
|
---|
278 | 2019-11-05T05:24:20.155+0000 E QUERY [js] Error: [db ateacrawldata] is not a valid database name :
|
---|
279 | Mongo.prototype.getDB@src/mongo/shell/mongo.js:51:12
|
---|
280 | getDatabase@src/mongo/shell/session.js:913:28
|
---|
281 | DB.prototype.getSiblingDB@src/mongo/shell/db.js:22:12
|
---|
282 | shellHelper.use@src/mongo/shell/utils.js:803:10
|
---|
283 | shellHelper@src/mongo/shell/utils.js:790:15
|
---|
284 | @(shellhelp2):1:1
|
---|
285 | > db.createCollection('webpages');
|
---|
286 | { "ok" : 1 }
|
---|
287 | > db.webpages.drop();
|
---|
288 | ... ^C
|
---|
289 |
|
---|
290 | > db.webpages.drop();
|
---|
291 | true
|
---|
292 | > use ateacrawldata
|
---|
293 | switched to db ateacrawldata
|
---|
294 | > db.createCollection('webpages');
|
---|
295 | { "ok" : 1 }
|
---|
296 | > show collections
|
---|
297 | webpages
|
---|
298 | > db.createCollection('websites');
|
---|
299 | { "ok" : 1 }
|
---|
300 | >
|
---|
301 |
|
---|
302 | ------------------------
|
---|
303 |
|
---|
304 | Ask Clint to rename "anupama" database to "ateacrawldata" database following the instructions at:
|
---|
305 | https://stackoverflow.com/questions/9201832/how-do-you-rename-a-mongodb-database
|
---|
306 | I don't have permissions to do this.
|
---|
307 | Nor do I have permissions to create Mongo collections within a new database that I create, like ateacrawldata.
|
---|
308 | I only seem to have rights to the "anupama" database.
|
---|
309 |
|
---|
310 |
|
---|
311 |
|
---|
312 | -----------------------
|
---|
313 | Vagrant virtual machine Node1 has the mongodb installed.
|
---|
314 |
|
---|
315 | After doing "vagrant up" on node1 to start node1:
|
---|
316 |
|
---|
317 | [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh
|
---|
318 | vagrant@node1:~$ mongo
|
---|
319 | MongoDB shell version v4.2.1
|
---|
320 | connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
|
---|
321 | 2019-11-13T09:22:46.996+0000 E QUERY [js] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused :
|
---|
322 | connect@src/mongo/shell/mongo.js:341:17
|
---|
323 | @(connect):2:6
|
---|
324 | 2019-11-13T09:22:46.999+0000 F - [main] exception: connect failed
|
---|
325 | 2019-11-13T09:22:46.999+0000 E - [main] exiting with code 1
|
---|
326 | vagrant@node1:~$ sudo service mongod status
|
---|
327 | â mongod.service - MongoDB Database Server
|
---|
328 | Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
|
---|
329 | Active: inactive (dead)
|
---|
330 | Docs: https://docs.mongodb.org/manual
|
---|
331 | vagrant@node1:~$ sudo service mongod start
|
---|
332 | vagrant@node1:~$ sudo service mongod status
|
---|
333 | â mongod.service - MongoDB Database Server
|
---|
334 | Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled)
|
---|
335 | Active: active (running) since Wed 2019-11-13 09:24:07 UTC; 2s ago
|
---|
336 | Docs: https://docs.mongodb.org/manual
|
---|
337 | Main PID: 4383 (mongod)
|
---|
338 | Tasks: 32
|
---|
339 | Memory: 199.3M
|
---|
340 | CPU: 754ms
|
---|
341 | CGroup: /system.slice/mongod.service
|
---|
342 | ââ4383 /usr/bin/mongod --config /etc/mongod.conf
|
---|
343 |
|
---|
344 | Nov 13 09:24:07 node1 systemd[1]: Started MongoDB Database Server.
|
---|
345 | vagrant@node1:~$
|
---|
346 |
|
---|
347 |
|
---|
348 | So now mongodb is running on node1 on localhost:27017.
|
---|
349 |
|
---|
350 | Next, in another x-term connected to analytics' node1 Vagrant VM, port forward node1's localhost:27017 to analytics' localhost:27017:
|
---|
351 | vagrant ssh -- -L 27017:localhost:27017
|
---|
352 |
|
---|
353 |
|
---|
354 |
|
---|
355 | Finally, in another x-term, port-forward from analytics:27017 to current machine's 27017:
|
---|
356 | ssh -L 27017:localhost:27017 analytics
|
---|
357 |
|
---|
358 |
|
---|
359 | Now can connect Robo-3T running on current machine to localhost:27017.
|
---|
360 |
|
---|
361 | Then in a new x-term, can use the client mongo shell to connect (by default to localhost:27017):
|
---|
362 |
|
---|
363 | wharariki:[122]/Scratch/ak19/GS309>mongo --shell
|
---|
364 | MongoDB shell version v4.0.13
|
---|
365 | connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
|
---|
366 | ...
|
---|
367 | > show dbs
|
---|
368 | admin 0.000GB
|
---|
369 | ateacrawldata 1.532GB
|
---|
370 | config 0.000GB
|
---|
371 | local 0.000GB
|
---|
372 | > use ateacrawldata
|
---|
373 |
|
---|
374 | > show collections
|
---|
375 | Webpages
|
---|
376 | Websites
|
---|
377 | oldwebpages
|
---|
378 | oldwebsites
|
---|
379 | -------------------
|
---|
380 |
|
---|
381 | Country code to geolocation CSV file found by Dr Bainbridge:
|
---|
382 | https://developers.google.com/public-data/docs/canonical/countries_csv
|
---|
383 |
|
---|
384 | Import into mongodb with:
|
---|
385 | https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv
|
---|
386 |
|
---|
387 |
|
---|
388 |
|
---|
389 | NOTE: mongoimport is a commandline utility and not a command to be run from the mongo shell. See https://jira.mongodb.org/browse/DOCS-11072
|
---|
390 | This means, in an x-term, DON'T RUN MONGO SHELL/client first. Instead, directly from x-term, run the following to import the countrycodes.csv file:
|
---|
391 |
|
---|
392 |
|
---|
393 | mongoimport -d ateacrawldata -c countrylocations --type csv --file /Scratch/ak19/maori-lang-detection/MoreReading/countrycodes.csv --headerline
|
---|
394 |
|
---|
395 |
|
---|
396 | -------------------------
|
---|
397 |
|
---|
398 | MONGODB QUERIES:
|
---|
399 |
|
---|
400 | db.getCollection('webpages').find({"isMRI": true, "singleSentences.langCode": "mri"})
|
---|
401 | db.getCollection('webpages').find({"singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"})
|
---|
402 | db.getCollection('Webpages').find({"isMRI": true, "singleSentences": { $elemMatch: {"langCode":"eng"} } }, {"singleSentences.$": "eng"}) [single English lang sentence]
|
---|
403 | db.getCollection('Webpages').find({"containsMRI": true, "singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) [gets 1st sentence of docs which have sentences containing MRI]
|
---|
404 |
|
---|
405 |
|
---|
406 | READING
|
---|
407 |
|
---|
408 | mongodb java convert class
|
---|
409 | https://www.quora.com/What-are-the-ways-of-converting-a-Java-object-to-a-MongoDB-document-and-vice-versa
|
---|
410 | https://stackoverflow.com/questions/39320825/pojo-to-org-bson-document-and-vice-versa
|
---|
411 | X https://mongodb.github.io/morphia/
|
---|
412 | https://stackoverflow.com/questions/10170506/inserting-java-object-to-mongodb-collection-using-java
|
---|
413 | X https://www.google.com/search?q=morphia+example&oq=morphia+example&aqs=chrome.0.0l6.4223j0j9&sourceid=chrome&ie=UTF-8
|
---|
414 | https://www.baeldung.com/mongodb-morphia
|
---|
415 | X https://web.archive.org/web/20171117121335/http://mongodb.github.io/morphia/1.3/getting-started/
|
---|
416 | => https://morphia.dev/1.4/getting-started/quick-tour/
|
---|
417 | https://github.com/MorphiaOrg/morphia/tree/master/docs/reference
|
---|
418 |
|
---|
419 |
|
---|
420 | mongodb querying
|
---|
421 | https://docs.mongodb.com/manual/tutorial/query-embedded-documents/
|
---|
422 | https://docs.mongodb.com/manual/tutorial/query-arrays/
|
---|
423 | https://www.google.com/search?q=mongodb+find+subdocument&oq=mongodb+find+&aqs=chrome.0.69i59j69i57j0l4.7607j1j8&sourceid=chrome&ie=UTF-8
|
---|
424 | https://stackoverflow.com/questions/25586901/how-to-find-document-and-single-subdocument-matching-given-criterias-in-mongodb
|
---|
425 | https://stackoverflow.com/questions/21113543/mongodb-get-subdocument
|
---|
426 | https://stackoverflow.com/questions/36948856/find-subdocuments-in-mongo
|
---|
427 | https://docs.mongodb.com/v3.0/reference/operator/projection/positional/#proj._S_
|
---|
428 | https://www.google.com/search?q=mongodb+query+tutorial&oq=mongodb+query+tutorial&aqs=chrome..69i57j0l2j69i60l3.4719j0j7&sourceid=chrome&ie=UTF-8
|
---|
429 | https://blog.exploratory.io/an-introduction-to-mongodb-query-for-beginners-bd463319aa4c
|
---|
430 | https://docs.mongodb.com/manual/reference/method/db.collection.find/
|
---|
431 | https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection
|
---|
432 | https://stackoverflow.com/questions/39641925/mongodb-aggregation-framework-to-get-frequencies-of-fields-values
|
---|
433 |
|
---|
434 | https://exploratory.io/note/kanaugust/0961813761939766
|
---|
435 | https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/
|
---|
436 | https://docs.mongodb.com/manual/aggregation/
|
---|
437 |
|
---|
438 |
|
---|
439 | Mongo Studio 3T documentation:
|
---|
440 | https://studio3t.com/download/ (also has uninstall information)
|
---|
441 | https://studio3t.com/download-thank-you/?OS=x64
|
---|
442 |
|
---|
443 | Google: MongoDB visualization
|
---|
444 | MongoDB visualization map
|
---|
445 | MongoDB Charts
|
---|
446 | (Open source visualisation tools)
|
---|
447 |
|
---|
448 | json map visualizer
|
---|
449 | geojson.tools
|
---|
450 | -------------------
|
---|
451 |
|
---|
452 | Some queries with results:
|
---|
453 |
|
---|
454 | # Num websites
|
---|
455 | db.getCollection('Websites').find({}).count()
|
---|
456 | 1445
|
---|
457 |
|
---|
458 | # Num webpages
|
---|
459 | db.getCollection('Webpages').find({}).count()
|
---|
460 | X75139
|
---|
461 | 117496
|
---|
462 |
|
---|
463 | # Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI)
|
---|
464 | db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
|
---|
465 | 361
|
---|
466 |
|
---|
467 | # Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
|
---|
468 | db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
|
---|
469 | 868
|
---|
470 |
|
---|
471 | # Obviously, the union of the above two will be identical to numPagesContainingMRI:
|
---|
472 | db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
|
---|
473 | 868
|
---|
474 |
|
---|
475 | # Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
|
---|
476 | db.getCollection('Webpages').find({isMRI:true}).count()
|
---|
477 | X5224
|
---|
478 | X5215
|
---|
479 | db.getCollection('Webpages').find({isMRI:true}).count()
|
---|
480 | 7818
|
---|
481 |
|
---|
482 | # Number of pages that contain any number of MRI sentences
|
---|
483 | db.getCollection('Webpages').find({containsMRI: true}).count()
|
---|
484 | X12858
|
---|
485 | 20371
|
---|
486 |
|
---|
487 |
|
---|
488 | # Number of sites with URLs containing /mi(/)
|
---|
489 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
|
---|
490 | X 153
|
---|
491 | # Number of sites with URLs containing /mi(/) OR http(s)://mi.*
|
---|
492 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count()
|
---|
493 | 670
|
---|
494 |
|
---|
495 | # Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls
|
---|
496 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
|
---|
497 | X 147
|
---|
498 | # Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* in any of its sub-urls
|
---|
499 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
|
---|
500 | 656
|
---|
501 |
|
---|
502 | # 6 sites with URLs containing /mi(/) that are in NZ
|
---|
503 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count()
|
---|
504 | X 6
|
---|
505 | # 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ
|
---|
506 | 14
|
---|
507 |
|
---|
508 |
|
---|
509 | # sort websites that contain /mi(/) in path by geoLocationCountryCode
|
---|
510 | # https://www.quackit.com/mongodb/tutorial/mongodb_sort_query_results.cfm
|
---|
511 | db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).sort({geoLocationCountryCode: 1})
|
---|
512 |
|
---|
513 | Actually, I want to sort by count. See https://docs.mongodb.com/manual/reference/operator/aggregation/sortByCount/
|
---|
514 |
|
---|
515 |
|
---|
516 | # PROJECTION:
|
---|
517 | db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInPath: 1})
|
---|
518 |
|
---|
519 | https://docs.mongodb.com/manual/aggregation/
|
---|
520 | EXAMPLE:
|
---|
521 | db.orders.aggregate([
|
---|
522 | { $match: { status: "A" } },
|
---|
523 | { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
|
---|
524 | ])
|
---|
525 |
|
---|
526 | X db.Websites.aggregate([{ $match:{urlContainsLangCodeInPath:true}}, $group: {geoLocationCountryCode:1, total: $count}])
|
---|
527 |
|
---|
528 |
|
---|
529 | X db.Websites.aggregate([
|
---|
530 | { $match:{urlContainsLangCodeInPath:true}},
|
---|
531 | {$group: {geoLocationCountryCode:1}}
|
---|
532 | ])
|
---|
533 |
|
---|
534 | WORKS (but an "unwind" will get rid of "null"):
|
---|
535 | db.Websites.aggregate([
|
---|
536 | { $match:{urlContainsLangCodeInPath:true}},
|
---|
537 | {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}}},
|
---|
538 | { $sort : { count : -1} }
|
---|
539 | ])
|
---|
540 |
|
---|
541 |
|
---|
542 | # COUNT OF ALL GEOLOCATION COUNTRIES
|
---|
543 | #https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
|
---|
544 | # LIST
|
---|
545 | db.Websites.distinct('geoLocationCountryCode');
|
---|
546 |
|
---|
547 | # COUNT
|
---|
548 | db.Websites.distinct('geoLocationCountryCode').length;
|
---|
549 |
|
---|
550 | # A COUNT WITH QUERY - https://docs.mongodb.com/manual/reference/command/distinct/#dbcmd.distinct
|
---|
551 |
|
---|
552 | db.runCommand ( { distinct: "Websites", key: "geoLocationCountryCode", query: { "urlContainsLangCodeInPath": true} } );
|
---|
553 |
|
---|
554 | # DISTINCT WITH QUERY WITHOUT COUNT - https://docs.mongodb.com/manual/reference/method/db.collection.distinct/
|
---|
555 | db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true});
|
---|
556 |
|
---|
557 | #SORTED - https://stackoverflow.com/questions/4759437/get-distinct-values-with-sorted-data
|
---|
558 | db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}).sort();
|
---|
559 |
|
---|
560 |
|
---|
561 | # count of all sites for which the geolocation is UNKNOWN
|
---|
562 | db.getCollection('Websites').find({geoLocationCountryCode: {$eq:"UNKNOWN"}}).count()
|
---|
563 |
|
---|
564 |
|
---|
565 | # AGGREGATION QUERIES THAT WORK:
|
---|
566 | #https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key
|
---|
567 |
|
---|
568 | WORKS:
|
---|
569 | // count of country codes for all sites
|
---|
570 | db.Websites.aggregate([
|
---|
571 |
|
---|
572 | { $unwind: "$geoLocationCountryCode" },
|
---|
573 | {
|
---|
574 | $group: {
|
---|
575 | _id: "$geoLocationCountryCode",
|
---|
576 | count: { $sum: 1 }
|
---|
577 | }
|
---|
578 | },
|
---|
579 | { $sort : { count : -1} }
|
---|
580 | ]);
|
---|
581 |
|
---|
582 | // count of country codes for sites that have at least one page detected as MRI
|
---|
583 |
|
---|
584 | db.Websites.aggregate([
|
---|
585 | {
|
---|
586 | $match: {
|
---|
587 | numPagesInMRI: {$gt: 0}
|
---|
588 | }
|
---|
589 | },
|
---|
590 | { $unwind: "$geoLocationCountryCode" },
|
---|
591 | {
|
---|
592 | $group: {
|
---|
593 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
594 | count: { $sum: 1 }
|
---|
595 | }
|
---|
596 | },
|
---|
597 | { $sort : { count : -1} }
|
---|
598 | ]);
|
---|
599 |
|
---|
600 | // count of country codes for sites that have at least one page containing at least one sentence detected as MRI
|
---|
601 | db.Websites.aggregate([
|
---|
602 | {
|
---|
603 | $match: {
|
---|
604 | numPagesContainingMRI: {$gt: 0}
|
---|
605 | }
|
---|
606 | },
|
---|
607 | { $unwind: "$geoLocationCountryCode" },
|
---|
608 | {
|
---|
609 | $group: {
|
---|
610 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
611 | count: { $sum: 1 }
|
---|
612 | }
|
---|
613 | },
|
---|
614 | { $sort : { count : -1} }
|
---|
615 | ]);
|
---|
616 |
|
---|
617 |
|
---|
618 | WORKS:
|
---|
619 | // count of country codes for sites that have /mi(/) or http(s)://mi.* in URL path
|
---|
620 |
|
---|
621 | db.Websites.aggregate([
|
---|
622 | {
|
---|
623 | $match: {
|
---|
624 | urlContainsLangCodeInPath: true
|
---|
625 | }
|
---|
626 | },
|
---|
627 | { $unwind: "$geoLocationCountryCode" },
|
---|
628 | {
|
---|
629 | $group: {
|
---|
630 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
631 | count: { $sum: 1 }
|
---|
632 | }
|
---|
633 | },
|
---|
634 | { $sort : { count : -1} }
|
---|
635 | ]);
|
---|
636 |
|
---|
637 |
|
---|
638 | WORKS:
|
---|
639 | db.Websites.aggregate([
|
---|
640 | {
|
---|
641 | $match: {
|
---|
642 | geoLocationCountryCode: {$ne : "UNKNOWN"}
|
---|
643 | }
|
---|
644 | },
|
---|
645 | { $unwind: "$geoLocationCountryCode" },
|
---|
646 | {
|
---|
647 | $group: {
|
---|
648 | _id: "$geoLocationCountryCode",
|
---|
649 | count: { $sum: 1 }
|
---|
650 | }
|
---|
651 | },
|
---|
652 | { $sort : { count : -1} }
|
---|
653 | ]);
|
---|
654 |
|
---|
655 | WORKS:
|
---|
656 | db.Websites.aggregate([
|
---|
657 | {
|
---|
658 | $match: {
|
---|
659 | "urlContainsLangCodeInPath": true
|
---|
660 | }
|
---|
661 | },
|
---|
662 | { $unwind: "$geoLocationCountryCode" },
|
---|
663 | {
|
---|
664 | $group: {
|
---|
665 | _id: "$geoLocationCountryCode",
|
---|
666 | count: { $sum: 1 }
|
---|
667 | }
|
---|
668 | },
|
---|
669 | { $sort : { count : -1} }
|
---|
670 | ]);
|
---|
671 |
|
---|
672 |
|
---|
673 | KEEP ADDITIONAL FIELDS - https://stackoverflow.com/questions/16662405/mongo-group-query-how-to-keep-fields:
|
---|
674 | a. KEEPS ONLY FIRST DOMAIN URL FOR EACH COUNTED COUNTRY CODE:
|
---|
675 |
|
---|
676 | db.Websites.aggregate([
|
---|
677 | {
|
---|
678 | $match: {
|
---|
679 | "urlContainsLangCodeInPath": true
|
---|
680 | }
|
---|
681 | },
|
---|
682 | { $unwind: "$geoLocationCountryCode" },
|
---|
683 | {
|
---|
684 | $group: {
|
---|
685 | _id: "$geoLocationCountryCode", count: { $sum: 1 },
|
---|
686 | domain: {$first: '$domain'}
|
---|
687 | }
|
---|
688 | },
|
---|
689 | { $sort : { count : -1} }
|
---|
690 | ]);
|
---|
691 |
|
---|
692 | b. KEEP ALL DOMAIN URLS:
|
---|
693 | db.Websites.aggregate([
|
---|
694 | {
|
---|
695 | $match: {
|
---|
696 | "urlContainsLangCodeInPath": true
|
---|
697 | }
|
---|
698 | },
|
---|
699 | { $unwind: "$geoLocationCountryCode" },
|
---|
700 | {
|
---|
701 | $group: {
|
---|
702 | _id: "$geoLocationCountryCode",
|
---|
703 | count: { $sum: 1 },
|
---|
704 | domain: { $addToSet: '$domain' }
|
---|
705 | }
|
---|
706 | },
|
---|
707 | { $sort : { count : -1} }
|
---|
708 | ]);
|
---|
709 |
|
---|
710 |
|
---|
711 | # WANT TO GET THE ABOVE INTO WORLD MAP, use geojson.tools found by Dr Bainbridge
|
---|
712 | geojson.tools
|
---|
713 | USAGE: https://www.here.xyz/viewer-tool/
|
---|
714 |
|
---|
715 |
|
---|
716 | AIMS:
|
---|
717 | * Identify where Maori language is online.
|
---|
718 | * How can we identify high quality sites that would be good for a corpus.
|
---|
719 | (Related work for other languages to quantifiably answer that)
|
---|
720 |
|
---|
721 | data-preparation
|
---|
722 | docs
|
---|
723 |
|
---|
724 |
|
---|
725 | ------------------------------------------
|
---|
726 |
|
---|
727 | BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
|
---|
728 | ---
|
---|
729 |
|
---|
730 | # https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or
|
---|
731 | # https://docs.mongodb.com/manual/reference/operator/query/and/
|
---|
732 |
|
---|
733 | # 1. all the websites which are from NZ:
|
---|
734 | db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count()
|
---|
735 | 128
|
---|
736 |
|
---|
737 | # 2. all the websites that have /mi in URL path which are from NZ:
|
---|
738 | db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]})
|
---|
739 | 6
|
---|
740 |
|
---|
741 | # 3. all the websites that don't have /mi in URLpath
|
---|
742 | db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count()
|
---|
743 | 1292
|
---|
744 |
|
---|
745 | # 4. all the websites that don't have /mi, or if they do are from NZ
|
---|
746 | # (should be the sum of the above points 2 and 3 above)
|
---|
747 | db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count()
|
---|
748 | 1298
|
---|
749 |
|
---|
750 | # 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ
|
---|
751 | # These are the TENTATIVE NON-PRODUCT SITES
|
---|
752 | # Should be less than the point 4, but more than 1 to 3
|
---|
753 |
|
---|
754 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count()
|
---|
755 | X 859
|
---|
756 |
|
---|
757 | Now with http(s)://mi.* also excluded, the above query returns a count of:
|
---|
758 | 389
|
---|
759 |
|
---|
760 |
|
---|
761 | BUT THIS IS THE CORRECT VERSION OF THE QUERY:
|
---|
762 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {urlContainsLangCodeInPath: false}]}]}).count()
|
---|
763 | 389
|
---|
764 |
|
---|
765 |
|
---|
766 | # 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix)
|
---|
767 | # Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori
|
---|
768 | db.Websites.aggregate([
|
---|
769 | {
|
---|
770 | $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}
|
---|
771 | },
|
---|
772 | { $unwind: "$geoLocationCountryCode" },
|
---|
773 | {
|
---|
774 | $group: {
|
---|
775 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
776 | count: { $sum: 1 }
|
---|
777 | }
|
---|
778 | },
|
---|
779 | { $sort : { count : -1} }
|
---|
780 | ]);
|
---|
781 |
|
---|
782 | The result is very close to the same aggregate on just numPagesContainingMRI.
|
---|
783 |
|
---|
784 | That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few:
|
---|
785 |
|
---|
786 | db.Websites.aggregate([
|
---|
787 | {
|
---|
788 | $match: {
|
---|
789 | $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]
|
---|
790 | }
|
---|
791 | },
|
---|
792 | { $unwind: "$geoLocationCountryCode" },
|
---|
793 | {
|
---|
794 | $group: {
|
---|
795 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
796 | count: { $sum: 1 }
|
---|
797 | }
|
---|
798 | },
|
---|
799 | { $sort : { count : -1} }
|
---|
800 | ]);
|
---|
801 |
|
---|
802 |
|
---|
803 | _id count
|
---|
804 | us 4.0
|
---|
805 | nz 4.0
|
---|
806 | au 3.0
|
---|
807 | ru 1.0
|
---|
808 | de 1.0
|
---|
809 |
|
---|
810 | Total: 13 sites that have /mi/ and are detected as having MRI content,
|
---|
811 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
|
---|
812 | 13
|
---|
813 |
|
---|
814 | Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI.
|
---|
815 |
|
---|
816 |
|
---|
817 | Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD!
|
---|
818 | /* 1 */
|
---|
819 | {
|
---|
820 | "_id" : "nz",
|
---|
821 | "count" : 4.0,
|
---|
822 | "domain" : [
|
---|
823 | "http://firstworldwar.tki.org.nz",
|
---|
824 | "http://www.firstworldwar.tki.org.nz",
|
---|
825 | "https://admin.teara.govt.nz",
|
---|
826 | "http://community.nzdl.org"
|
---|
827 | ]
|
---|
828 | }
|
---|
829 |
|
---|
830 | /* 2 */
|
---|
831 | {
|
---|
832 | "_id" : "us",
|
---|
833 | "count" : 4.0,
|
---|
834 | "domain" : [
|
---|
835 | "https://sexualviolence.victimsinfo.govt.nz",
|
---|
836 | "https://follow3rs.com",
|
---|
837 | "http://www.church-of-christ.org",
|
---|
838 | "http://www.mytrickstips.com"
|
---|
839 | ]
|
---|
840 | }
|
---|
841 |
|
---|
842 | /* 3 */
|
---|
843 | {
|
---|
844 | "_id" : "au",
|
---|
845 | "count" : 3.0,
|
---|
846 | "domain" : [
|
---|
847 | "https://rapuatearatika.education.govt.nz",
|
---|
848 | "https://www.kiwiproperty.com",
|
---|
849 | "https://curriculumtool.education.govt.nz"
|
---|
850 | ]
|
---|
851 | }
|
---|
852 |
|
---|
853 | /* 4 */
|
---|
854 | {
|
---|
855 | "_id" : "ru",
|
---|
856 | "count" : 1.0,
|
---|
857 | "domain" : [
|
---|
858 | "http://www.treningmozga.com"
|
---|
859 | ]
|
---|
860 | }
|
---|
861 |
|
---|
862 | /* 5 */
|
---|
863 | {
|
---|
864 | "_id" : "de",
|
---|
865 | "count" : 1.0,
|
---|
866 | "domain" : [
|
---|
867 | "http://www.almancax.com" # Website to learn German, autotranslated
|
---|
868 | ]
|
---|
869 | }
|
---|
870 |
|
---|
871 |
|
---|
872 | But we're not catching a potentially large number of auto-translated sites, like
|
---|
873 | - https://www.gigalight.com/all-languages.html
|
---|
874 | - http://www.hzhinew.com/
|
---|
875 |
|
---|
876 | https://culturesconnection.com/manual-or-automatic-translation/
|
---|
877 | Manual Or Automatic Translation?
|
---|
878 |
|
---|
879 | Automatic translation continues to improve day by day. However, it is still unable to reach perfect levels of accuracy and lacks a natural feel. Will it ever replace human translation?
|
---|
880 |
|
---|
881 | --------------
|
---|
882 | Mr Bill Rogers' suggestions for beginnings of trying to sieve out the auto-translated sites:
|
---|
883 | - skip .com. .co.<tld>. But .co.nz is also used for non-commercial sites or sites that nevertheless have high quality Maori language content.
|
---|
884 | - change cut-off value of OpenNLP language prediction? But for sentences and overlapping sentences, we're not using the cut-off value, we're just checking the best predicted language regardless of confidence level for this.
|
---|
885 |
|
---|
886 | - Code for (a range of) loading of language options in auto-translated sites?
|
---|
887 |
|
---|
888 | ====================
|
---|
889 |
|
---|
890 | # https://stackoverflow.com/questions/20175122/how-can-i-use-not-like-operator-in-mongodb
|
---|
891 |
|
---|
892 | Info on the sites with Maori language content that are either from NZ or have .nz domain (TLD):
|
---|
893 |
|
---|
894 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]})
|
---|
895 |
|
---|
896 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}).count()
|
---|
897 | 183
|
---|
898 |
|
---|
899 | Inverse: the sites detected as containing at least 1 Maori language sentence that are NOT from NZ NOR have .nz domain ending (TLD):
|
---|
900 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}]}).count()
|
---|
901 | 685
|
---|
902 |
|
---|
903 | The above two figures correctly add up to a total of 868 sites, which is the number of sites detected as containing at least 1 sentence in MRI:
|
---|
904 | db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
|
---|
905 | 868
|
---|
906 |
|
---|
907 | Without those with /mi in path:
|
---|
908 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]}).count()
|
---|
909 |
|
---|
910 | Now let's get a listing of all 685 sites to be manually inspected to determine whether they're auto-translated:
|
---|
911 |
|
---|
912 | /*
|
---|
913 | db.Websites.aggregate([
|
---|
914 | {
|
---|
915 | $match: {
|
---|
916 | $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]
|
---|
917 | }
|
---|
918 | },
|
---|
919 | { $unwind: "$geoLocationCountryCode" },
|
---|
920 | {
|
---|
921 | $group: {
|
---|
922 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
923 | count: { $sum: 1 },
|
---|
924 | domain: { $addToSet: '$domain' }
|
---|
925 | }
|
---|
926 | },
|
---|
927 | { $sort : { count : -1} }
|
---|
928 | ]);
|
---|
929 | */
|
---|
930 | db.Websites.aggregate([
|
---|
931 | {
|
---|
932 | $match: {
|
---|
933 | $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}}]
|
---|
934 | }
|
---|
935 | },
|
---|
936 | { $unwind: "$geoLocationCountryCode" },
|
---|
937 | {
|
---|
938 | $group: {
|
---|
939 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
940 | count: { $sum: 1 },
|
---|
941 | domain: { $addToSet: '$domain' }
|
---|
942 | }
|
---|
943 | },
|
---|
944 | { $sort : { count : -1} }
|
---|
945 | ]);
|
---|
946 |
|
---|
947 |
|
---|
948 | We can knock of another 54 non-NZ sites with our new urlContainsLangCodeInPathPrefix field:
|
---|
949 |
|
---|
950 | db.getCollection('Websites').find({urlContainsLangCodeInPathPrefix: true, geoLocationCountryCode: {$ne: "NZ"}, domain: {$not: /.nz$/}}).count()
|
---|
951 | 54
|
---|
952 |
|
---|
953 |
|
---|
954 | SO, can repeat query with new field "urlContainsLangCodeInPathPrefix":
|
---|
955 | Number of sites containing >= 1 MRI sentences that are not from NZ or of .nz TLD and which don't contain "/mi(/)" or "http(s)://mi." in URL path:
|
---|
956 | db.getCollection('Websites').find({$and: [
|
---|
957 | {numPagesContainingMRI: {$gt: 0}},
|
---|
958 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
959 | {domain: {$not: /.nz$/}},
|
---|
960 | {urlContainsLangCodeInPathSuffix: {$ne: true}},
|
---|
961 | {urlContainsLangCodeInPathPrefix: {$ne: true}}
|
---|
962 | ]}).count()
|
---|
963 |
|
---|
964 | 651
|
---|
965 |
|
---|
966 |
|
---|
967 | REDO THE COUNT BY COUNTRY QUERY FOR THIS:
|
---|
968 |
|
---|
969 | db.Websites.aggregate([
|
---|
970 | {
|
---|
971 | $match: {
|
---|
972 | $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPathSuffix: {$ne: true}}, {urlContainsLangCodeInPathPrefix: {$ne: true}}]
|
---|
973 | }
|
---|
974 | },
|
---|
975 | { $unwind: "$geoLocationCountryCode" },
|
---|
976 | {
|
---|
977 | $group: {
|
---|
978 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
979 | count: { $sum: 1 },
|
---|
980 | domain: { $addToSet: '$domain' }
|
---|
981 | }
|
---|
982 | },
|
---|
983 | { $sort : { count : -1} }
|
---|
984 | ]);
|
---|
985 |
|
---|
986 |
|
---|
987 | AFTER BUGFIX FOR miInURLPath being set at the correct now:
|
---|
988 | db.getCollection('Websites').find(
|
---|
989 | {$and: [
|
---|
990 | {numPagesContainingMRI: {$gt: 0}},
|
---|
991 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
992 | {domain: {$not: /.nz$/}},
|
---|
993 | {urlContainsLangCodeInPath: {$ne: true}}
|
---|
994 | ]}).count()
|
---|
995 |
|
---|
996 | 220
|
---|
997 |
|
---|
998 | db.Websites.aggregate([
|
---|
999 | {
|
---|
1000 | $match: {
|
---|
1001 | $and: [
|
---|
1002 | {numPagesContainingMRI: {$gt: 0}},
|
---|
1003 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
1004 | {domain: {$not: /.nz$/}},
|
---|
1005 | {urlContainsLangCodeInPath: {$ne: true}}
|
---|
1006 | ]
|
---|
1007 | }
|
---|
1008 | },
|
---|
1009 | { $unwind: "$geoLocationCountryCode" },
|
---|
1010 | {
|
---|
1011 | $group: {
|
---|
1012 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
1013 | count: { $sum: 1 },
|
---|
1014 | domain: { $addToSet: '$domain' }
|
---|
1015 | }
|
---|
1016 | },
|
---|
1017 | { $sort : { count : -1} }
|
---|
1018 | ]);
|
---|
1019 |
|
---|
1020 | Can inspect websites' pages for whether it's relevant/auto-translated as follows:
|
---|
1021 | db.getCollection('Webpages').find({URL:/svenkirsten.com/, mriSentenceCount: {$gt: 0}})
|
---|
1022 |
|
---|
1023 |
|
---|
1024 | * CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/
|
---|
1025 | BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/
|
---|
1026 |
|
---|
1027 | * FR: 35 sites from FR
|
---|
1028 | http://blueheavenisland.com - French Polynesia
|
---|
1029 | https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway
|
---|
1030 | http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori.
|
---|
1031 | !! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers
|
---|
1032 | http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names
|
---|
1033 | http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid
|
---|
1034 | *
|
---|
1035 |
|
---|
1036 |
|
---|
1037 | DE:
|
---|
1038 | http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful
|
---|
1039 | !! https://www.cartogiraffe.com/ - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI
|
---|
1040 | ~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden"
|
---|
1041 | - herocity - autotranslated
|
---|
1042 | - weltderberge.de - 3 pages mention NZ mountains by name.
|
---|
1043 | ~ (arts.mythologica.fr) https://mythologica.fr/oceanie/texte/pantheon_polynesien.pdf - mentions certain Maori Gods and other Polynesian Gods by name.
|
---|
1044 | - https://traynews.com - nothing in MRI, misdetected
|
---|
1045 | ~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names
|
---|
1046 | - http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI
|
---|
1047 | - https://afrikhepri.org/mi/ - autotranslated
|
---|
1048 | - https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word
|
---|
1049 | - etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected
|
---|
1050 |
|
---|
1051 | - ITALY:
|
---|
1052 | http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename
|
---|
1053 | http://www.marcosanti.it/Reportage/Oceania_ph/Nuova_Zelanda/ - each photo captioned by NZ placename
|
---|
1054 | http://www.pegasoesmicamion.com/ - REO abbreviation misidentified, also in REO%20PUBLICIDAD.htm
|
---|
1055 | - AUSTRIA:
|
---|
1056 | petit-prince.at - Tahitian and Wayuu (Venezuela) translations of Le Petit Prince
|
---|
1057 | http://www.tmtmm.net/newzealand - photos from NZ named after places and people's names
|
---|
1058 | - ROMANIA: parohiauceadesus.ro - Sentences of single Romanian words misidentified.
|
---|
1059 | - ISRAEL:
|
---|
1060 | http://www.daat.ac.il - misidentification of "no." as MRI, and Hebrew words.
|
---|
1061 | https://www.hitiaotera.com/ - misidentifiation of Tahitian pages
|
---|
1062 | - RUSSIA: https://www.gismeteo.lv - misidentification of an email address
|
---|
1063 | - JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI
|
---|
1064 | !! Ireland, ie: https://coggle.it
|
---|
1065 | - IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website
|
---|
1066 | ? - CZECH republic: https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English
|
---|
1067 | - SPAIN: http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename
|
---|
1068 | - SINGAPORE: https://omg-solutions.com - autotranslated
|
---|
1069 | - TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated
|
---|
1070 | - MEXICO: http://www.gelbukh.com - misidentification, lines of just numbers or phrases like "Area Chair" in English and Russian CVs.
|
---|
1071 | - FINLAND: http://pertti.com - travelogue, placenames
|
---|
1072 | - SWITZERLAND CH:
|
---|
1073 | nicoledidi.ch - blog, placenames
|
---|
1074 | https://photos.axelebert.org - Tahiti related content
|
---|
1075 | - UNKNOWN: https://www.viveipcl.com: tours website, placenames mentioned
|
---|
1076 | #- EU: https://www.the-good-stuff-factory.be/mi/ : Autotranslated
|
---|
1077 | !! - BULGARIA: http://anitra.net/activism/humanrights/UDHR/rrt_print.htm (2 pages)
|
---|
1078 |
|
---|
1079 |
|
---|
1080 | TREATING AUSTRALIA AND GREAT BRITAIN MORE SPECIALLY (don't ignore /mi in URL, same as with NZ, but do leave out .nz TLDs since we cover them under NZ - TODO: later find country codes of all .nz TLDs):
|
---|
1081 | [nothing found under "UK", only under "GB"]
|
---|
1082 |
|
---|
1083 | db.getCollection('Websites').find({
|
---|
1084 | domain: {$not: /.nz$/},
|
---|
1085 | numPagesContainingMRI: {$gt: 0},
|
---|
1086 | $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}]
|
---|
1087 | }).count()
|
---|
1088 | 11
|
---|
1089 |
|
---|
1090 | db.Websites.aggregate([
|
---|
1091 | {
|
---|
1092 | $match: {
|
---|
1093 | domain: {$not: /.nz$/},
|
---|
1094 | numPagesContainingMRI: {$gt: 0},
|
---|
1095 | $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}]
|
---|
1096 | }
|
---|
1097 | },
|
---|
1098 | { $unwind: "$geoLocationCountryCode" },
|
---|
1099 | {
|
---|
1100 | $group: {
|
---|
1101 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
1102 | count: { $sum: 1 },
|
---|
1103 | domain: { $addToSet: '$domain' }
|
---|
1104 | }
|
---|
1105 | },
|
---|
1106 | { $sort : { count : -1} }
|
---|
1107 | ]);
|
---|
1108 |
|
---|
1109 | AUSTRALIA:
|
---|
1110 | !! https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]
|
---|
1111 | ? http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!")
|
---|
1112 | !! https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd - site of individual pages (like docs.google.com). This one has a relevant infogram image.
|
---|
1113 | !! https://koreromaori.com - some actual Maori language sentences
|
---|
1114 | http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames
|
---|
1115 |
|
---|
1116 | UK:
|
---|
1117 | http://www.wordsearchfun.com/200628_Word_Find_wordsearch.html - 2 word games with Maori words (one of them has 3 different views, e.g. print view)
|
---|
1118 | ? https://omniatlas.com/maps/australasia/18400206/plain/ - historical map with Maori iwi names over NZ map regions
|
---|
1119 | ? https://omniatlas.com/maps/australasia/18400206/ - historical map of Australia and NZ at the time of the Treaty of Waitangi, with events marked in English
|
---|
1120 | https://centrallanguageschool.com - AUTOTRANSLATED
|
---|
1121 | https://www.solasolv.com - Autotranslated product site
|
---|
1122 | http://mikestephens.co.uk/ - photo captions containing NZ placenames
|
---|
1123 | http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames
|
---|
1124 | --------------
|
---|
1125 |
|
---|
1126 | GETTING TABLE DATA OUT OF MONGO DB:
|
---|
1127 |
|
---|
1128 | https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo
|
---|
1129 | "export to file" as in a spreadsheet like to a .csv?
|
---|
1130 |
|
---|
1131 | IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo):
|
---|
1132 |
|
---|
1133 | 1. In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything
|
---|
1134 |
|
---|
1135 | 2. paste everything into this website: https://json-csv.com/
|
---|
1136 |
|
---|
1137 | 3. click the download button and now you have it in a spreadsheet.
|
---|
1138 |
|
---|
1139 |
|
---|
1140 | https://json-csv.com/
|
---|
1141 |
|
---|
1142 |
|
---|
1143 | ---------------------
|
---|
1144 |
|
---|
1145 | Count of websites that have at least 1 page containing at least one sentence detected as MRI
|
---|
1146 | AND which websites have mi in the URL path:
|
---|
1147 |
|
---|
1148 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count()
|
---|
1149 |
|
---|
1150 | 491
|
---|
1151 |
|
---|
1152 |
|
---|
1153 |
|
---|
1154 | # The websites that have some MRI detected AND which are either in NZ or with NZ TLD
|
---|
1155 | # or (so if they're from overseas) don't contain /mi or mi.* in URL path:
|
---|
1156 |
|
---|
1157 | db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}]}).count()
|
---|
1158 | 396
|
---|
1159 |
|
---|
1160 | Include Australia (to get the valid "kiwiproperty.com" website included in the result list):
|
---|
1161 |
|
---|
1162 | db.getCollection('Websites').find({$and: [
|
---|
1163 | {numPagesContainingMRI: {$gt: 0}},
|
---|
1164 | {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
|
---|
1165 | ]}).count()
|
---|
1166 |
|
---|
1167 | 397
|
---|
1168 |
|
---|
1169 | # aggregate results by a count of country codes
|
---|
1170 | db.Websites.aggregate([
|
---|
1171 | {
|
---|
1172 | $match: {
|
---|
1173 | $and: [
|
---|
1174 | {numPagesContainingMRI: {$gt: 0}},
|
---|
1175 | {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}
|
---|
1176 | ]
|
---|
1177 | }
|
---|
1178 | },
|
---|
1179 | { $unwind: "$geoLocationCountryCode" },
|
---|
1180 | {
|
---|
1181 | $group: {
|
---|
1182 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
1183 | count: { $sum: 1 }
|
---|
1184 | }
|
---|
1185 | },
|
---|
1186 | { $sort : { count : -1} }
|
---|
1187 | ]);
|
---|
1188 |
|
---|
1189 |
|
---|
1190 | # Just considering those sites outside NZ or not with .nz TLD:
|
---|
1191 |
|
---|
1192 | db.getCollection('Websites').find({$and: [
|
---|
1193 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
1194 | {domain: {$not: /\.nz/}},
|
---|
1195 | {numPagesContainingMRI: {$gt: 0}},
|
---|
1196 | {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
|
---|
1197 | ]}).count()
|
---|
1198 |
|
---|
1199 | 221 websites
|
---|
1200 |
|
---|
1201 | # counts by country code excluding NZ related sites
|
---|
1202 | db.Websites.aggregate([
|
---|
1203 | {
|
---|
1204 | $match: {
|
---|
1205 | $and: [
|
---|
1206 | {geoLocationCountryCode: {$ne: "NZ"}},
|
---|
1207 | {domain: {$not: /\.nz/}},
|
---|
1208 | {numPagesContainingMRI: {$gt: 0}},
|
---|
1209 | {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
|
---|
1210 | ]
|
---|
1211 | }
|
---|
1212 | },
|
---|
1213 | { $unwind: "$geoLocationCountryCode" },
|
---|
1214 | {
|
---|
1215 | $group: {
|
---|
1216 | _id: {$toLower: '$geoLocationCountryCode'},
|
---|
1217 | count: { $sum: 1 },
|
---|
1218 | domain: { $addToSet: '$domain' }
|
---|
1219 | }
|
---|
1220 | },
|
---|
1221 | { $sort : { count : -1} }
|
---|
1222 | ]);
|
---|
1223 |
|
---|
1224 |
|
---|
1225 | # But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
|
---|
1226 | db.getCollection('Websites').find({$and: [
|
---|
1227 | {numPagesContainingMRI: {$gt: 0}},
|
---|
1228 | {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
|
---|
1229 | ]}).count()
|
---|
1230 |
|
---|
1231 | 176
|
---|
1232 |
|
---|
1233 | (Total is 221+176 = 397, which adds up).
|
---|
1234 |
|
---|
1235 | # Get the count (and domain listing) output put under a hardcoded _id of "nz":
|
---|
1236 | db.Websites.aggregate([
|
---|
1237 | {
|
---|
1238 | $match: {
|
---|
1239 | $and: [
|
---|
1240 | {numPagesContainingMRI: {$gt: 0}},
|
---|
1241 | {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
|
---|
1242 | ]
|
---|
1243 | }
|
---|
1244 | },
|
---|
1245 | { $unwind: "$geoLocationCountryCode" },
|
---|
1246 | {
|
---|
1247 | $group: {
|
---|
1248 | _id: "nz",
|
---|
1249 | count: { $sum: 1 },
|
---|
1250 | domain: { $addToSet: '$domain' }
|
---|
1251 | }
|
---|
1252 | },
|
---|
1253 | { $sort : { count : -1} }
|
---|
1254 | ]);
|
---|