root/other-projects/maori-lang-detection/MoreReading/mongodb.txt @ 33666

Revision 33666, 17.6 KB (checked in by ak19, 2 months ago)

Having finished sending all the crawl data to mongodb 1. Recrawled the 2 sites which I had earlier noted required recrawling 00152, 00332. 00152 required changes to how it needed to be crawled. MP3 files needed to be blocked, as there were HBase error messages about key values being too large. 2. Modified the regex-urlfilter.GS_TEMPLATE file for this to block mp3 files in general for future crawls too (in the location of the file where jpg etc were already blocked by nutch's default regex url filters). 3. Further had to control the 00152 site to only be crawled under its /maori/ sub-domain. Since the seedURL maori.html was not off a /maori/ url, this revealed that the CCWETProcessor code didn't already consider allowing filters to okay seedURLs even where the crawl was controlled to run over a subdomain (as expressed in conf/sites-too-big-to-exhaustively-crawl file) but where the seedURL didn't match these controlled regex filters. So now, in such cases, the CCWETProcessor adds seedURLs that don't match to the filters too (so we get just the single file of the seedURL pages) besides a filter on the requested subdomain, so we follow all pages linked by the seedURLs that match the subdomain expression. 4. Adding to_crawl.tar.gz to svn, the tarball of the sites to_crawl that I actually ran nutch over, of all the sites folders with their seedURL.txt and regex-urlfilter.txt files that the batchcrawl.sh runs over. This didn't use the latest version of the sites folder and blacklist/whitelist files generated by CCWETProcessor, since the latest version was regenerated after the final modifications to CCWETProcessor which was after crawling was finished. But to_crawl.tar.gz does have a manually modified 00152, wit the correct regex-urlfilter file and uses the newer regex-urlfilter.GS_TEMPLATE file that blocks mp3 files. 5. crawledNode6.tar.gz now contains the dump output for sites 00152 and 00332, which were crawled on node6 today (after which their processed dump.txt file results were added into MongoDB). 7. MoreReading?/mongodb.txt now contains the results of some queries I ran against the total nutch-crawled data.

Line 
1MongoDB
2Installation:
3    https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
4        https://docs.mongodb.com/manual/administration/install-on-linux/
5    https://hevodata.com/blog/install-mongodb-on-ubuntu/
6    https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04
7    CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/
8    FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source
9GUI:
10    https://robomongo.org/
11    Robomongo is Robo 3T now
12
13https://www.tutorialspoint.com/mongodb/mongodb_java.htm
14JAR FILE:
15    http://central.maven.org/maven2/org/mongodb/mongo-java-driver/
16    https://mongodb.github.io/mongo-java-driver/
17
18
19
20https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
21http://www.programmersought.com/article/6500308940/
22
23   52  sudo apt-get install mongodb-clients
24   53  mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
25
26Failed with
27    Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
28    exception: connect failed
29
30This is due to a version incompatibility between Client and mongodb Server.
31The solution is to follow instructions at http://www.programmersought.com/article/6500308940/
32and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
33as below:
34
35   54  sudo apt-get purge mongodb-clients
36   55  sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
37   56  echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
38   57  sudo apt-get update
39   58  sudo apt-get install mongodb-clients
40   59  mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
41(still doesn't work)
42   60  sudo apt-get install -y mongodb-org
43The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server?
44   72  sudo service mongod status
45
46  103  sudo service mongod start
47"mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections
48  104  sudo service mongod status
49   88  sudo service mongod stop
50
51
52DETAILS:
53
54wharariki:[879]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
55
56didn't work with the pwd. Failed with:
57
58    MongoDB shell version: 2.6.10
59    Enter password:
60    connecting to: mongodb://mongodb.cms.waikato.ac.nz:27017
61    2019-11-04T20:02:47.970+1300 Assertion: 13110:HostAndPort: host is empty
62    2019-11-04T20:02:47.970+1300 0x6b75c9 0x659e9f 0x636f69 0x4fa55c 0x501249 0x4fa7f1 0x6006fd 0x5eb869 0x7f7bfbd47d76 0x1f3c10d06362
63     mongo(_ZN5mongo15printStackTraceERSo+0x39) [0x6b75c9]
64     mongo(_ZN5mongo10logContextEPKc+0x21f) [0x659e9f]
65     mongo(_ZN5mongo11msgassertedEiPKc+0xd9) [0x636f69]
66     mongo(_ZN5mongo16ConnectionString12_fillServersENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50c) [0x4fa55c]
67     mongo(_ZN5mongo16ConnectionStringC1ENS0_14ConnectionTypeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_+0x99) [0x501249]
68     mongo(_ZN5mongo16ConnectionString5parseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERS6_+0x201) [0x4fa7f1]
69     mongo(_ZN5mongo17mongoConsExternalEPNS_7V8ScopeERKN2v89ArgumentsE+0x11d) [0x6006fd]
70     mongo(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0xa9) [0x5eb869]
71     /usr/lib/libv8.so.3.14.5(+0x99d76) [0x7f7bfbd47d76]
72     [0x1f3c10d06362]
73    2019-11-04T20:02:47.971+1300 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
74    exception: connect failed
75
76
77This is due to a version incompatibility between Client and mongodb Server.
78Can find client version above. (2.6.10)
79Server version can be found by running the mongo client shell. Doing so without loading a db:
80
81
82    wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb
83    MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION
84    type "help" for help
85    > help
86        db.help()                    help on db methods
87        db.mycoll.help()             help on collection methods
88        sh.help()                    sharding helpers
89        rs.help()                    replica set helpers
90        help admin                   administrative help
91        help connect                 connecting to a db help
92        help keys                    key shortcuts
93        help misc                    misc things to know
94        help mr                      mapreduce
95
96        show dbs                     show database names
97        show collections             show collections in current database
98        show users                   show users in current database
99        show profile                 show most recent system.profile entries with time >= 1ms
100        show logs                    show the accessible logger names
101        show log [name]              prints out the last segment of log in memory, 'global' is default
102        use <db_name>                set current database
103        db.foo.find()                list objects in collection foo
104        db.foo.find( { a : 1 } )     list objects in foo where a == 1
105        it                           result of the last line evaluated; use to further iterate
106        DBQuery.shellBatchSize = x   set default number of items to display on shell
107        exit                         quit the mongo shell
108
109    > help connect
110
111    Normally one specifies the server on the mongo shell command line.  Run mongo --help to see those options.
112    Additional connections may be opened:
113
114        var x = new Mongo('host[:port]');
115        var mydb = x.getDB('mydb');
116      or
117        var mydb = connect('host[:port]/mydb');
118
119    Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
120
121    Getting help on connect options:
122
123    > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017');
124    > var mydb = x.getDB('anupama');
125
126    > mydb.connect.help()
127    DBCollection help
128        db.connect.find().help() - show DBCursor help
129        db.connect.count()
130        db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
131        db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command
132        db.connect.dataSize()
133        db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' )
134        db.connect.drop() drop the collection
135        db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } )
136        db.connect.dropIndexes()
137        db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
138        db.connect.reIndex()
139        db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
140                                                  e.g. db.connect.find( {x:77} , {name:1, x:1} )
141        db.connect.find(...).count()
142        db.connect.find(...).limit(n)
143        db.connect.find(...).skip(n)
144        db.connect.find(...).sort(...)
145        db.connect.findOne([query])
146        db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
147        db.connect.getDB() get DB object associated with collection
148        db.connect.getPlanCache() get query plan cache associated with collection
149        db.connect.getIndexes()
150        db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
151        db.connect.insert(obj)
152        db.connect.mapReduce( mapFunction , reduceFunction , <optional params> )
153        db.connect.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
154        db.connect.remove(query)
155        db.connect.renameCollection( newName , <dropTarget> ) renames the collection.
156        db.connect.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
157        db.connect.save(obj)
158        db.connect.stats()
159        db.connect.storageSize() - includes free space allocated to this collection
160        db.connect.totalIndexSize() - size in bytes of all the indexes
161        db.connect.totalSize() - storage allocated for all data and indexes
162        db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi
163        db.connect.validate( <full> ) - SLOW
164        db.connect.getShardVersion() - only for use with sharding
165        db.connect.getShardDistribution() - prints statistics about data distribution in the cluster
166        db.connect.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
167        db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
168        db.connect.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
169        db.connect.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
170    > mydb.version()
171    4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION
172
173(Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb)
174
175Finally we now know the mongodb server version: 4.0.13
176This version doesn't work with our mongo client (shell) version of 2.6.10.
177
178
179DETAILS OF INSTALLING MONGO-CLIENT AND UPDATING IT, AND INSTALLING MONGODB SERVER:
180
181
182   54  sudo apt-get purge mongodb-clients
183   55  sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
184   56  echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
185   57  sudo apt-get update
186   58  sudo apt-get install mongodb-clients
187   59  mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
188   60  sudo apt-get install -y mongodb-org
189   61  mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
190   62  sudo service apache2 status
191   63  sudo service sshd status
192   64  sudo service mongodb status
193   65  sudo service mongo status
194   66  mongod
195   67  mongod --help
196   68  mongod --help | less
197   69  mongod -f /etc/mongod.conf
198   70  sudo mongod -f /etc/mongod.conf
199   71  less /etc/mongod.conf
200   72  sudo service mongod status
201   73  sudo service mongod start
202   74  sudo service mongod status
203   75  ls -l  /var/log/mongodb/mongod.log
204   76  sudo rm  /var/log/mongodb/mongod.log
205   77  sudo service mongod status
206   78  sudo service mongod start
207   79  sudo service mongod status
208   80  sudo service mongod stop
209   81  ps auxww | grep mongo
210   82  sudo service mongod start
211   83  sudo service mongod status
212   84  ps auxww | grep mongo
213   85  sudo dmsg
214   86  sudo dmesg
215   87  sudo service mongod status
216   88  sudo service mongod stop
217   89  sudo service mongod start
218   90  sudo dmesg
219   91  sudo less  /var/log/mongodb/mongod.log
220   92  ls /var/lib/
221   93  ls -ld /var/lib/
222   94  ls -l  /var/log/mongodb/mongod.log
223   95  ls -ld /var/lib/
224   96  groups mongodb
225   97  less /etc/mongod.conf
226   98  sudo less  /var/log/mongodb/mongod.log
227   99  less /etc/mongod.conf
228  100  ls -l /var/lib/mongodb/
229  101  sudo chown -R mongodb /var/lib/mongodb/
230  102  sudo chgrp -R mongodb /var/lib/mongodb/
231  103  sudo service mongod start
232  104  sudo service mongod status
233  105  history
234
235
236
237MONGO DB ROBO 3T
2381. Download "Double Pack" from https://robomongo.org/
2392. Untar its contents. Then untar the tarball in that.
2403. Run:
241    wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t
242
243===================
244On analytics, vagrant node1, we've installed the mongodb server and client.
245We're able to successfully create collections on here.
246
247
248vagrant@node1:~$ mongo
249MongoDB shell version v4.2.1
250connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
251Implicit session: session { "id" : UUID("87bb585c-4685-47f6-bf89-a93801daeb2d") }
252MongoDB server version: 4.2.1
253Server has startup warnings:
2542019-11-04T07:48:14.197+0000 I  STORAGE  [initandlisten]
2552019-11-04T07:48:14.198+0000 I  STORAGE  [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2562019-11-04T07:48:14.198+0000 I  STORAGE  [initandlisten] **          See http://dochub.mongodb.org/core/prodnotes-filesystem
2572019-11-04T07:48:14.624+0000 I  CONTROL  [initandlisten]
2582019-11-04T07:48:14.624+0000 I  CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2592019-11-04T07:48:14.624+0000 I  CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2602019-11-04T07:48:14.624+0000 I  CONTROL  [initandlisten]
261---
262Enable MongoDB's free cloud-based monitoring service, which will then receive and display
263metrics about your deployment (disk utilization, CPU, operation statistics, etc).
264
265The monitoring data will be available on a MongoDB website with a unique URL accessible to you
266and anyone you share the URL with. MongoDB may use this information to make product
267improvements and to suggest MongoDB products and deployment options to you.
268
269To enable free monitoring, run the following command: db.enableFreeMonitoring()
270To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
271---
272
273> show dbs
274admin   0.000GB
275config  0.000GB
276local   0.000GB
277> use db ateacrawldata
2782019-11-05T05:24:20.155+0000 E  QUERY    [js] Error: [db ateacrawldata] is not a valid database name :
279Mongo.prototype.getDB@src/mongo/shell/mongo.js:51:12
280getDatabase@src/mongo/shell/session.js:913:28
281DB.prototype.getSiblingDB@src/mongo/shell/db.js:22:12
282shellHelper.use@src/mongo/shell/utils.js:803:10
283shellHelper@src/mongo/shell/utils.js:790:15
284@(shellhelp2):1:1
285> db.createCollection('webpages');
286{ "ok" : 1 }
287> db.webpages.drop();
288... ^C
289
290> db.webpages.drop();
291true
292> use ateacrawldata
293switched to db ateacrawldata
294> db.createCollection('webpages');
295{ "ok" : 1 }
296> show collections
297webpages
298> db.createCollection('websites');
299{ "ok" : 1 }
300>
301
302------------------------
303
304Ask Clint to rename "anupama" database to "ateacrawldata" database following the instructions at:
305    https://stackoverflow.com/questions/9201832/how-do-you-rename-a-mongodb-database
306I don't have permissions to do this.
307Nor do I have permissions to create Mongo collections within a new database that I create, like ateacrawldata.
308I only seem to have rights to the "anupama" database.
309
310
311
312-----------------------
313
314MONGODB QUERIES:
315
316db.getCollection('webpages').find({"isMRI": true, "singleSentences.langCode": "mri"})
317db.getCollection('webpages').find({"singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"})
318db.getCollection('Webpages').find({"isMRI": true, "singleSentences": { $elemMatch: {"langCode":"eng"} } }, {"singleSentences.$": "eng"}) [single English lang sentence]
319db.getCollection('Webpages').find({"containsMRI": true, "singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) [gets 1st sentence of docs which have sentences containing MRI]
320
321
322READING
323
324mongodb java convert class
325https://www.quora.com/What-are-the-ways-of-converting-a-Java-object-to-a-MongoDB-document-and-vice-versa
326https://stackoverflow.com/questions/39320825/pojo-to-org-bson-document-and-vice-versa
327X https://mongodb.github.io/morphia/
328https://stackoverflow.com/questions/10170506/inserting-java-object-to-mongodb-collection-using-java
329X https://www.google.com/search?q=morphia+example&oq=morphia+example&aqs=chrome.0.0l6.4223j0j9&sourceid=chrome&ie=UTF-8
330https://www.baeldung.com/mongodb-morphia
331X https://web.archive.org/web/20171117121335/http://mongodb.github.io/morphia/1.3/getting-started/
332=> https://morphia.dev/1.4/getting-started/quick-tour/
333https://github.com/MorphiaOrg/morphia/tree/master/docs/reference
334
335
336mongodb querying
337https://docs.mongodb.com/manual/tutorial/query-embedded-documents/
338https://docs.mongodb.com/manual/tutorial/query-arrays/
339https://www.google.com/search?q=mongodb+find+subdocument&oq=mongodb+find+&aqs=chrome.0.69i59j69i57j0l4.7607j1j8&sourceid=chrome&ie=UTF-8
340https://stackoverflow.com/questions/25586901/how-to-find-document-and-single-subdocument-matching-given-criterias-in-mongodb
341https://stackoverflow.com/questions/21113543/mongodb-get-subdocument
342https://stackoverflow.com/questions/36948856/find-subdocuments-in-mongo
343https://docs.mongodb.com/v3.0/reference/operator/projection/positional/#proj._S_
344https://www.google.com/search?q=mongodb+query+tutorial&oq=mongodb+query+tutorial&aqs=chrome..69i57j0l2j69i60l3.4719j0j7&sourceid=chrome&ie=UTF-8
345https://blog.exploratory.io/an-introduction-to-mongodb-query-for-beginners-bd463319aa4c
346https://docs.mongodb.com/manual/reference/method/db.collection.find/
347https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection
348
349
350-------------------
351
352Some queries with results:
353
354# Num websites
355db.getCollection('Websites').find({}).count()
3561446
357
358# Num webpages
359db.getCollection('Webpages').find({}).count()
36075139
361
362# Find number of websites who have 1 or more pages in Maori (a positive numPagesInMRI)
363db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count()
364361
365
366# Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true)
367db.getCollection('Webpages').find({isMRI:true}).count()
368X5224
3695215
370
371# Number of pages that contain any number of MRI sentences
372db.getCollection('Webpages').find({containsMRI: true}).count()
37312858
374
375# Number of sites with URLs containing /mi(/)
376db.getCollection('Websites').find({urlContainsLangCodeInpath:true}).count()
377153
378
379# Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls
380db.getCollection('Websites').find({urlContainsLangCodeInpath:true, geoLocationCountryCode: {$ne : "NZ"} }).count()
381148
382
383# 5 sites with URLs containing /mi(/) that are in NZ
384db.getCollection('Websites').find({urlContainsLangCodeInpath:true, geoLocationCountryCode: "NZ"}).count()
3855
386
387# sort websites that contain /mi(/) in path by geoLocationCountryCode
388#    https://www.quackit.com/mongodb/tutorial/mongodb_sort_query_results.cfm
389db.getCollection('Websites').find({urlContainsLangCodeInpath:true}).sort({geoLocationCountryCode: 1})
390
391
Note: See TracBrowser for help on using the browser.