source: gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt@ 33623

Last change on this file since 33623 was 33623, checked in by ak19, 4 years ago
  1. Incorporated Dr Nichols earlier suggestion of storing page modified time and char-encoding metadata if present in the crawl dump output. Have done so, but neither modifiedTime nor fetchTime metadata of the dump file appear to be a webpage's actual modified time, as they're from 2019 and set around the period we've been crawling. 2. Moved getDomainFromURL() function from CCWETProcessor.java to Utility.java since it's been reused. 3. MongoDBAccess class successfully connects (at least, no exceptions) and uses the newly added properties in config.properties to make the connection.
File size: 24.3 KB
Line 
1https://codereview.stackexchange.com/questions/198343/crawl-and-gather-all-the-urls-recursively-in-a-domain
2http://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
3
4https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
5
6https://cwiki.apache.org/confluence/display/nutch/
7https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling
8https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
9
10https://moz.com/top500
11-----------
12NUTCH
13-----------
14https://stackoverflow.com/questions/35449673/nutch-and-solr-indexing-blacklist-domain
15 https://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/urlfilter/domainblacklist/DomainBlacklistURLFilter.html
16
17https://lucene.472066.n3.nabble.com/blacklist-for-crawling-td618343.html
18https://lucene.472066.n3.nabble.com/Content-of-size-X-was-truncated-to-Y-td4003517.html
19
20
21Google: nutch mirror web site
22https://stackoverflow.com/questions/33354460/nutch-clone-website
23[https://stackoverflow.com/questions/35714897/nutch-not-crawling-entire-website
24fetch -all seems to be a nutch v2 thing?]
25
26Google (30 Sep): site mirroring with nutch
27https://grokbase.com/t/nutch/user/125sfbg0pt/using-nutch-for-web-site-mirroring
28https://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
29http://www.cs.ucy.ac.cy/courses/EPL660/lectures/lab6.pdf
30 slide p.5 onwards
31
32crawler softw options: https://repositorio.iscte-iul.pt/bitstream/10071/2871/1/Building%20a%20Scalable%20Index%20and%20Web%20Search%20Engine%20for%20Music%20on.pdf
33See also p.20. HTTrack
34
35
36Google: nutch performance tuning
37* https://stackoverflow.com/questions/24383212/apache-nutch-performance-tuning-for-whole-web-crawling
38* https://stackoverflow.com/questions/4871972/how-to-speed-up-crawling-in-nutch
39* https://cwiki.apache.org/confluence/display/nutch/OptimizingCrawls
40
41NUTCH INSTALLATION:
42* Nutch v1: https://cwiki.apache.org/confluence/display/nutch/NutchTutorial#NutchTutorial-SetupSolrforsearch
43
44Nutch v2 installation and set up:
45* https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial
46* https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783286850/1/ch01lvl1sec09/installing-and-configuring-apache-nutch
47
48
49Nutch doesn't work with spark (yet):
50https://stackoverflow.com/questions/29950299/distributed-web-crawling-using-apache-spark-is-it-possible
51
52SOLR:
53* Query syntax: http://www.solrtutorial.com/solr-query-syntax.html
54* Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html
55
56
57* If you change a nutch 2 configuration, https://stackoverflow.com/questions/16401667/java-lang-classnotfoundexception-org-apache-gora-hbase-store-hbasestore
58explains you can rebuild nutch with:
59 cd <apache-nutch>
60 ant clean
61 ant runtime
62----------------------------------
63Apache Nutch 2 with newer HBase
64
65hbase-common-1.4.8.jar
66
671. hbase jar files need to go into runtime/local/lib
68
69But not slf4j-log4j12-1.7.10.jar (there's already a slf4j-log4j12-1.7.5.jar) - so remove that from runtime/local/lib after copying it over.
70
712. https://stackoverflow.com/questions/46340416/how-to-compile-nutch-2-3-1-with-hbase-1-2-6
72 https://stackoverflow.com/questions/39834423/apache-nutch-fetcherjob-throws-nosuchelementexception-deep-in-gora/39837926#39837926
73
74Unfortunately, the page https://paste.apache.org/jjqz referred to above that contains patches for using Gora 0.7 is no longer available.
75
76http://mail-archives.apache.org/mod_mbox/nutch-user/201602.mbox/%[email protected]%3E
77
78https://www.mail-archive.com/[email protected]/msg14245.html
79
80------------------------------------------------------------------------------
81Other way: Nutch on its own vagrant with specified hbase or nutch with mongodb
82------------------------------------------------------------------------------
83* https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/
84* https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
85
86The older but recommended hbase 0.98.21 for hadoop 2 can be downloaded from https://archive.apache.org/dist/hbase/0.98.21/
87
88-----
89HBASE commands
90/usr/local/hbase/bin/hbase shell
91https://learnhbase.net/2013/03/02/hbase-shell-commands/
92http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/
93dropping tables: https://www.tutorialspoint.com/hbase/hbase_drop_table.htm
94
95> list
96
97davidbHomePage_webpage is a table
98
99> get 'davidbHomePage_webpage', '1'
100
101Solution to get a working nutch2:
102get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz
103And follow the instructions in my README file in there.
104
105---------------------------------------------------------------------
106ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
107---------------------------------------------------------------------
108=> https://anarc.at/services/archive/web/
109 Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
110 https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
111 https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
112 To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
113 https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
114https://alternativeto.net/software/apache-nutch/
115https://alternativeto.net/software/wget/
116https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
117https://github.com/ArchiveTeam/wpull
118
119-------------------
120
121Running nutch 2.x
122
123-------------------
124
125LINKS
126
127https://lucene.472066.n3.nabble.com/Nutch-2-x-readdb-command-dump-td4033937.html
128https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
129
130
131https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/ ## most useful for running nutch 2.x crawls
132
133https://www.mobomo.com/2017/06/the-basics-working-with-nutch-2-x/
134 "Fetch
135
136 This is where the magic happens. During the fetch step, Nutch crawls the urls selected in the generate step. The most important argument you need is -threads: this sets the number of fetcher threads per task. Increasing this will make crawling faster, but setting it too high can overwhelm a site and it might shut out your crawler, as well as take up too much memory from your machine. Run it like this:
137 $ nutch fetch -threads 50"
138
139
140https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-nutch-tutorial/
141https://www.yegor256.com/2019/04/17/nutch-from-java.html
142
143http://nutch.sourceforge.net/docs/en/tutorial.html
144
145Intranet: Configuration
146To configure things for intranet crawling you must:
147
148 Create a flat file of root urls. For example, to crawl the nutch.org site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like:
149
150 http://www.nutch.org/
151
152 Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read:
153
154 +^http://([a-z0-9]*\.)*nutch.org/
155
156 This will include any url in the domain nutch.org.
157
158Intranet: Running the Crawl
159Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
160
161 -dir dir names the directory to put the crawl in.
162 -depth depth indicates the link depth from the root page that should be crawled.
163 -delay delay determines the number of seconds between accesses to each host.
164 -threads threads determines the number of threads that will fetch in parallel.
165
166For example, a typical call might be:
167
168bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
169
170Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10. <===========
171
172Once crawling has completed, one can skip to the Searching section below.
173
174
175-----------------------------------
176Actually running nutch 2.x - steps
177-----------------------------------
178MANUALLY GOING THROUGH THE CYCLE 3 TIMES:
179
180cd ~/apache-nutch-2.3.1/runtime/local
181
182./bin/nutch inject urls
183
184./bin/nutch generate -topN 50
185./bin/nutch fetch -all
186./bin/nutch parse -all
187./bin/nutch updatedb -all
188
189./bin/nutch generate -topN 50
190./bin/nutch fetch -all
191./bin/nutch parse -all
192./bin/nutch updatedb -all
193
194./bin/nutch generate -topN 50
195./bin/nutch fetch -all
196./bin/nutch parse -all
197./bin/nutch updatedb -all
198
199Dump output on local filesystem:
200 rm -rf /tmp/bla
201 ./bin/nutch readdb -dump /tmp/bla [-crawlId ID -text]
202 less /tmp/bla/part-r-00000
203
204To dump output on local filesystem:
205 Need hdfs host name if sending/dumping nutch crawl output to a location on hdfs
206 Host is defined in /usr/local/hadoop/etc/hadoop/core-site.xml for property fs.defaultFS, (https://stackoverflow.com/questions/27956973/java-io-ioexception-incomplete-hdfs-uri-no-host)
207 host is hdfs://node2/ in this case.
208 So:
209
210 hdfs dfs -rmdir /user/vagrant/dump
211 XXX ./bin/nutch readdb -dump user/vagrant/dump -text ### won't work
212 XXX ./bin/nutch readdb -dump hdfs:///user/vagrant/dump -text ### won't work
213 ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump -text
214
215
216USING THE SCRIPT TO ATTEMPT TO CRAWL A SITE
217* Choosing to repeat the cycle 10 times because, as per http://nutch.sourceforge.net/docs/en/tutorial.html
218
219"Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10."
220
221* Use the ./bin/crawl script, provide the seed urls dir, the crawlId and number of times to repeat = 10
222vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage 10
223
224
225* View the downloaded crawls.
226This time need to provide crawlId to readdb, in order to get a dump of its text contents:
227 hdfs dfs -rm -r hdfs://node2/user/vagrant/dump2
228 ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump2 -text -crawlId davidbHomePage
229
230* View the contents:
231hdfs dfs -cat hdfs://node2/user/vagrant/dump2/part-r-*
232
233
234* FIND OUT NUMBER OF URLS DOWNLOADED FOR THE SITE:
235vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage
236WebTable statistics start
237Statistics for WebTable:
238retry 0: 44
239status 5 (status_redir_perm): 4
240status 3 (status_gone): 1
241status 2 (status_fetched): 39
242jobs: {[davidbHomePage]db_stats-job_local647846559_0001={jobName=[davidbHomePage]db_stats, jobID=job_local647846559_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=0, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=595591168}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788140, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
243TOTAL urls: 44
244max score: 1.0
245avg score: 0.022727273
246min score: 0.0
247WebTable statistics: done
248
249------------------------------------
250STOPPING CONDITION
251Seems inbuilt
252* When I tell it to cycle 15 times, it stops after 6 cycles saying no more URLs to fetch:
253
254vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage2 15
255---
256No SOLRURL specified. Skipping indexing.
257Injecting seed URLs
258
259...
260
261Thu Oct 3 09:22:23 UTC 2019 : Iteration 6 of 15
262Generating batchId
263Generating a new fetchlist
264...
265Generating batchId
266Generating a new fetchlist
267/home/vagrant/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId davidbHomePage2 -batchId 1570094569-27637
268GeneratorJob: starting at 2019-10-03 09:22:49
269GeneratorJob: Selecting best-scoring urls due for fetch.
270GeneratorJob: starting
271GeneratorJob: filtering: false
272GeneratorJob: normalizing: false
273GeneratorJob: topN: 50000
274GeneratorJob: finished at 2019-10-03 09:22:52, time elapsed: 00:00:02
275GeneratorJob: generated batch id: 1570094569-27637 containing 0 URLs
276Generate returned 1 (no new segments created)
277Escaping loop: no more URLs to fetch now
278vagrant@node2:~/apache-nutch-2.3.1/runtime/local$
279---
280
281* running readdb -stats show 44 URLs fetched, just as first time (when crawlId had been "davidbHomePage"):
282
283vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage2
284---
285WebTable statistics start
286Statistics for WebTable:
287retry 0: 44
288status 5 (status_redir_perm): 4
289status 3 (status_gone): 1
290status 2 (status_fetched): 39
291jobs: {[davidbHomePage2]db_stats-job_local985519583_0001={jobName=[davidbHomePage2]db_stats, jobID=job_local985519583_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=4, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=552599552}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788152, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
292TOTAL urls: 44
293---
294
295----------------------------------------------------------------------
296 Testing URLFilters: testing a URL to see if it's accepted
297----------------------------------------------------------------------
298Use the command
299 ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
300(mentioned at https://lucene.472066.n3.nabble.com/Correct-syntax-for-regex-urlfilter-txt-trying-to-exclude-single-path-results-td3600376.html)
301
302Use as follows:
303
304 cd apache-nutch-2.3.1/runtime/local
305
306 ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
307
308Then paste the URL you want to test, press Enter.
309 A + in front of response means accepted
310 A - in front of response means rejected.
311Can continue pasting URLs to test against filters until you send Ctrl-D to terminate input.
312
313
314
315
316-------------------
317Dr Nichols's suggestion: can store listing of potential product sites to inspect by checking url for /mi in combination with whether the domain's IP geolocates to OUTSIDE New Zealand (tld nz).
318* https://stackoverflow.com/questions/1415851/best-way-to-get-geo-location-in-java
319 - https://mvnrepository.com/artifact/com.maxmind.geoip/geoip-api/1.2.10
320 - older .dat.gz file is archived at https://web.archive.org/web/20180917084618/http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
321 - and newer geo country data at https://dev.maxmind.com/geoip/geoip2/geolite2/
322* https://dev.maxmind.com/geoip/geoip2/geolite2/
323* older GeoIp API (has LookupService): https://github.com/maxmind/geoip-api-java
324* Newer GeoIp2 API: https://dev.maxmind.com/geoip/geoip2/downloadable/#MaxMind_APIs
325 and https://maxmind.github.io/GeoIP2-java/doc/v2.12.0/
326* https://maxmind.github.io/GeoIP2-java/
327* https://github.com/AtlasOfLivingAustralia/ala-hub/issues/11
328
329
330---
331https://check-host.net/ip-info
332https://ipinfo.info/html/ip_checker.php
333
334
335
336----------
337MongoDB
338Installation:
339 https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
340 https://docs.mongodb.com/manual/administration/install-on-linux/
341 https://hevodata.com/blog/install-mongodb-on-ubuntu/
342 https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04
343 CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/
344 FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source
345GUI:
346 https://robomongo.org/
347 Robomongo is Robo 3T now
348
349https://www.tutorialspoint.com/mongodb/mongodb_java.htm
350JAR FILE:
351 http://central.maven.org/maven2/org/mongodb/mongo-java-driver/
352 https://mongodb.github.io/mongo-java-driver/
353
354
355INSTALLING THE MONGODB SERVER AND MONGO CLIENT ON LINUX
356Need to have sudo and root powers.
357
358https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
359http://www.programmersought.com/article/6500308940/
360
361 52 sudo apt-get install mongodb-clients
362 53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
363
364Failed with
365 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
366 exception: connect failed
367
368This is due to a version incompatibility between Client and mongodb Server.
369The solution is to follow instructions at http://www.programmersought.com/article/6500308940/
370and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
371as below:
372
373 54 sudo apt-get purge mongodb-clients
374 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
375 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
376 57 sudo apt-get update
377 58 sudo apt-get install mongodb-clients
378 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
379(still doesn't work)
380 60 sudo apt-get install -y mongodb-org
381The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server?
382 72 sudo service mongod status
383
384 103 sudo service mongod start
385"mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections
386 104 sudo service mongod status
387 88 sudo service mongod stop
388
389
390RUNNING AND USING THE MONGO CLIENT SHELL:
391Among the many things you can do with the Mongo client shell, one can use it to find the mongo client version (which is the version of the shell) and the mongo db version.
392
393To run the mongo client shell WITHOUT loading a db:
394
395
396 wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb
397 MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION
398
399 type "help" for help
400 > help
401 db.help() help on db methods
402 db.mycoll.help() help on collection methods
403 sh.help() sharding helpers
404 rs.help() replica set helpers
405 help admin administrative help
406 help connect connecting to a db help
407 help keys key shortcuts
408 help misc misc things to know
409 help mr mapreduce
410
411 show dbs show database names
412 show collections show collections in current database
413 show users show users in current database
414 show profile show most recent system.profile entries with time >= 1ms
415 show logs show the accessible logger names
416 show log [name] prints out the last segment of log in memory, 'global' is default
417 use <db_name> set current database
418 db.foo.find() list objects in collection foo
419 db.foo.find( { a : 1 } ) list objects in foo where a == 1
420 it result of the last line evaluated; use to further iterate
421 DBQuery.shellBatchSize = x set default number of items to display on shell
422 exit quit the mongo shell
423
424 > help connect
425
426 Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options.
427 Additional connections may be opened:
428
429 var x = new Mongo('host[:port]');
430 var mydb = x.getDB('mydb');
431 or
432 var mydb = connect('host[:port]/mydb');
433
434 Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
435
436 Getting help on connect options:
437
438 > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017');
439 > var mydb = x.getDB('anupama');
440
441 > mydb.connect.help()
442 DBCollection help
443 db.connect.find().help() - show DBCursor help
444 db.connect.count()
445 db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
446 db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command
447 db.connect.dataSize()
448 db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' )
449 db.connect.drop() drop the collection
450 db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } )
451 db.connect.dropIndexes()
452 db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
453 db.connect.reIndex()
454 db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
455 e.g. db.connect.find( {x:77} , {name:1, x:1} )
456 db.connect.find(...).count()
457 db.connect.find(...).limit(n)
458 db.connect.find(...).skip(n)
459 db.connect.find(...).sort(...)
460 db.connect.findOne([query])
461 db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
462 db.connect.getDB() get DB object associated with collection
463 db.connect.getPlanCache() get query plan cache associated with collection
464 db.connect.getIndexes()
465 db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
466 db.connect.insert(obj)
467 db.connect.mapReduce( mapFunction , reduceFunction , <optional params> )
468 db.connect.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
469 db.connect.remove(query)
470 db.connect.renameCollection( newName , <dropTarget> ) renames the collection.
471 db.connect.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
472 db.connect.save(obj)
473 db.connect.stats()
474 db.connect.storageSize() - includes free space allocated to this collection
475 db.connect.totalIndexSize() - size in bytes of all the indexes
476 db.connect.totalSize() - storage allocated for all data and indexes
477 db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi
478 db.connect.validate( <full> ) - SLOW
479 db.connect.getShardVersion() - only for use with sharding
480 db.connect.getShardDistribution() - prints statistics about data distribution in the cluster
481 db.connect.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
482 db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
483 db.connect.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
484 db.connect.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
485 > mydb.version()
486 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION
487
488(Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb)
489
490Finally we now know the mongodb server version: 4.0.13
491This version didn't work with our mongo client (shell) version of 2.6.10. And that's we had to upgrade the client.
492
493
494INSTALLATION MONGO-DB AND CLIENT
495FROM: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
496 wget -qO - https://www.mongodb.org/static/pgp/server-4.2.asc | sudo apt-key add -
497 echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.2 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.2.list
498 sudo apt-get update
499 sudo apt-get install -y mongodb-org
500
501UNINSTALLING
502 https://www.anintegratedworld.com/uninstall-mongodb-in-ubuntu-via-command-line-in-3-easy-steps/
503
504
505MONGO DB ROBO 3T
5061. Download "Double Pack" from https://robomongo.org/
5072. Untar its contents. Then untar the tarball in that.
5083. Run:
509 wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t
510
Note: See TracBrowser for help on using the repository browser.