Context Navigation

crawling-Nutch.txt@ 33623

Last change on this file since 33623 was 33623, checked in by ak19, 4 years ago

Incorporated Dr Nichols earlier suggestion of storing page modified time and char-encoding metadata if present in the crawl dump output. Have done so, but neither modifiedTime nor fetchTime metadata of the dump file appear to be a webpage's actual modified time, as they're from 2019 and set around the period we've been crawling. 2. Moved getDomainFromURL() function from CCWETProcessor.java to Utility.java since it's been reused. 3. MongoDBAccess class successfully connects (at least, no exceptions) and uses the newly added properties in config.properties to make the connection.

File size: 24.3 KB

Line
1	https://codereview.stackexchange.com/questions/198343/crawl-and-gather-all-the-urls-recursively-in-a-domain
2	http://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
3
4	https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
5
6	https://cwiki.apache.org/confluence/display/nutch/
7	https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling
8	https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
9
10	https://moz.com/top500
11	-----------
12	NUTCH
13	-----------
14	https://stackoverflow.com/questions/35449673/nutch-and-solr-indexing-blacklist-domain
15	https://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/urlfilter/domainblacklist/DomainBlacklistURLFilter.html
16
17	https://lucene.472066.n3.nabble.com/blacklist-for-crawling-td618343.html
18	https://lucene.472066.n3.nabble.com/Content-of-size-X-was-truncated-to-Y-td4003517.html
19
20
21	Google: nutch mirror web site
22	https://stackoverflow.com/questions/33354460/nutch-clone-website
23	[https://stackoverflow.com/questions/35714897/nutch-not-crawling-entire-website
24	fetch -all seems to be a nutch v2 thing?]
25
26	Google (30 Sep): site mirroring with nutch
27	https://grokbase.com/t/nutch/user/125sfbg0pt/using-nutch-for-web-site-mirroring
28	https://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
29	http://www.cs.ucy.ac.cy/courses/EPL660/lectures/lab6.pdf
30	slide p.5 onwards
31
32	crawler softw options: https://repositorio.iscte-iul.pt/bitstream/10071/2871/1/Building%20a%20Scalable%20Index%20and%20Web%20Search%20Engine%20for%20Music%20on.pdf
33	See also p.20. HTTrack
34
35
36	Google: nutch performance tuning
37	* https://stackoverflow.com/questions/24383212/apache-nutch-performance-tuning-for-whole-web-crawling
38	* https://stackoverflow.com/questions/4871972/how-to-speed-up-crawling-in-nutch
39	* https://cwiki.apache.org/confluence/display/nutch/OptimizingCrawls
40
41	NUTCH INSTALLATION:
42	* Nutch v1: https://cwiki.apache.org/confluence/display/nutch/NutchTutorial#NutchTutorial-SetupSolrforsearch
43
44	Nutch v2 installation and set up:
45	* https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial
46	* https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783286850/1/ch01lvl1sec09/installing-and-configuring-apache-nutch
47
48
49	Nutch doesn't work with spark (yet):
50	https://stackoverflow.com/questions/29950299/distributed-web-crawling-using-apache-spark-is-it-possible
51
52	SOLR:
53	* Query syntax: http://www.solrtutorial.com/solr-query-syntax.html
54	* Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html
55
56
57	* If you change a nutch 2 configuration, https://stackoverflow.com/questions/16401667/java-lang-classnotfoundexception-org-apache-gora-hbase-store-hbasestore
58	explains you can rebuild nutch with:
59	cd <apache-nutch>
60	ant clean
61	ant runtime
62	----------------------------------
63	Apache Nutch 2 with newer HBase
64
65	hbase-common-1.4.8.jar
66
67	1. hbase jar files need to go into runtime/local/lib
68
69	But not slf4j-log4j12-1.7.10.jar (there's already a slf4j-log4j12-1.7.5.jar) - so remove that from runtime/local/lib after copying it over.
70
71	2. https://stackoverflow.com/questions/46340416/how-to-compile-nutch-2-3-1-with-hbase-1-2-6
72	https://stackoverflow.com/questions/39834423/apache-nutch-fetcherjob-throws-nosuchelementexception-deep-in-gora/39837926#39837926
73
74	Unfortunately, the page https://paste.apache.org/jjqz referred to above that contains patches for using Gora 0.7 is no longer available.
75
76	http://mail-archives.apache.org/mod_mbox/nutch-user/201602.mbox/%[email protected]%3E
77
78	https://www.mail-archive.com/[email protected]/msg14245.html
79
80	------------------------------------------------------------------------------
81	Other way: Nutch on its own vagrant with specified hbase or nutch with mongodb
82	------------------------------------------------------------------------------
83	* https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/
84	* https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
85
86	The older but recommended hbase 0.98.21 for hadoop 2 can be downloaded from https://archive.apache.org/dist/hbase/0.98.21/
87
88	-----
89	HBASE commands
90	/usr/local/hbase/bin/hbase shell
91	https://learnhbase.net/2013/03/02/hbase-shell-commands/
92	http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/
93	dropping tables: https://www.tutorialspoint.com/hbase/hbase_drop_table.htm
94
95	> list
96
97	davidbHomePage_webpage is a table
98
99	> get 'davidbHomePage_webpage', '1'
100
101	Solution to get a working nutch2:
102	get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz
103	And follow the instructions in my README file in there.
104
105	---------------------------------------------------------------------
106	ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
107	---------------------------------------------------------------------
108	=> https://anarc.at/services/archive/web/
109	Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
110	https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
111	https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
112	To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
113	https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
114	https://alternativeto.net/software/apache-nutch/
115	https://alternativeto.net/software/wget/
116	https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
117	https://github.com/ArchiveTeam/wpull
118
119	-------------------
120
121	Running nutch 2.x
122
123	-------------------
124
125	LINKS
126
127	https://lucene.472066.n3.nabble.com/Nutch-2-x-readdb-command-dump-td4033937.html
128	https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
129
130
131	https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/ ## most useful for running nutch 2.x crawls
132
133	https://www.mobomo.com/2017/06/the-basics-working-with-nutch-2-x/
134	"Fetch
135
136	This is where the magic happens. During the fetch step, Nutch crawls the urls selected in the generate step. The most important argument you need is -threads: this sets the number of fetcher threads per task. Increasing this will make crawling faster, but setting it too high can overwhelm a site and it might shut out your crawler, as well as take up too much memory from your machine. Run it like this:
137	$ nutch fetch -threads 50"
138
139
140	https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-nutch-tutorial/
141	https://www.yegor256.com/2019/04/17/nutch-from-java.html
142
143	http://nutch.sourceforge.net/docs/en/tutorial.html
144
145	Intranet: Configuration
146	To configure things for intranet crawling you must:
147
148	Create a flat file of root urls. For example, to crawl the nutch.org site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like:
149
150	http://www.nutch.org/
151
152	Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read:
153
154	+^http://([a-z0-9]\.)nutch.org/
155
156	This will include any url in the domain nutch.org.
157
158	Intranet: Running the Crawl
159	Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
160
161	-dir dir names the directory to put the crawl in.
162	-depth depth indicates the link depth from the root page that should be crawled.
163	-delay delay determines the number of seconds between accesses to each host.
164	-threads threads determines the number of threads that will fetch in parallel.
165
166	For example, a typical call might be:
167
168	bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
169
170	Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10. <===========
171
172	Once crawling has completed, one can skip to the Searching section below.
173
174
175	-----------------------------------
176	Actually running nutch 2.x - steps
177	-----------------------------------
178	MANUALLY GOING THROUGH THE CYCLE 3 TIMES:
179
180	cd ~/apache-nutch-2.3.1/runtime/local
181
182	./bin/nutch inject urls
183
184	./bin/nutch generate -topN 50
185	./bin/nutch fetch -all
186	./bin/nutch parse -all
187	./bin/nutch updatedb -all
188
189	./bin/nutch generate -topN 50
190	./bin/nutch fetch -all
191	./bin/nutch parse -all
192	./bin/nutch updatedb -all
193
194	./bin/nutch generate -topN 50
195	./bin/nutch fetch -all
196	./bin/nutch parse -all
197	./bin/nutch updatedb -all
198
199	Dump output on local filesystem:
200	rm -rf /tmp/bla
201	./bin/nutch readdb -dump /tmp/bla [-crawlId ID -text]
202	less /tmp/bla/part-r-00000
203
204	To dump output on local filesystem:
205	Need hdfs host name if sending/dumping nutch crawl output to a location on hdfs
206	Host is defined in /usr/local/hadoop/etc/hadoop/core-site.xml for property fs.defaultFS, (https://stackoverflow.com/questions/27956973/java-io-ioexception-incomplete-hdfs-uri-no-host)
207	host is hdfs://node2/ in this case.
208	So:
209
210	hdfs dfs -rmdir /user/vagrant/dump
211	XXX ./bin/nutch readdb -dump user/vagrant/dump -text ### won't work
212	XXX ./bin/nutch readdb -dump hdfs:///user/vagrant/dump -text ### won't work
213	./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump -text
214
215
216	USING THE SCRIPT TO ATTEMPT TO CRAWL A SITE
217	* Choosing to repeat the cycle 10 times because, as per http://nutch.sourceforge.net/docs/en/tutorial.html
218
219	"Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10."
220
221	* Use the ./bin/crawl script, provide the seed urls dir, the crawlId and number of times to repeat = 10
222	vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage 10
223
224
225	* View the downloaded crawls.
226	This time need to provide crawlId to readdb, in order to get a dump of its text contents:
227	hdfs dfs -rm -r hdfs://node2/user/vagrant/dump2
228	./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump2 -text -crawlId davidbHomePage
229
230	* View the contents:
231	hdfs dfs -cat hdfs://node2/user/vagrant/dump2/part-r-*
232
233
234	* FIND OUT NUMBER OF URLS DOWNLOADED FOR THE SITE:
235	vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage
236	WebTable statistics start
237	Statistics for WebTable:
238	retry 0: 44
239	status 5 (status_redir_perm): 4
240	status 3 (status_gone): 1
241	status 2 (status_fetched): 39
242	jobs: {[davidbHomePage]db_stats-job_local647846559_0001={jobName=[davidbHomePage]db_stats, jobID=job_local647846559_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=0, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=595591168}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788140, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
243	TOTAL urls: 44
244	max score: 1.0
245	avg score: 0.022727273
246	min score: 0.0
247	WebTable statistics: done
248
249	------------------------------------
250	STOPPING CONDITION
251	Seems inbuilt
252	* When I tell it to cycle 15 times, it stops after 6 cycles saying no more URLs to fetch:
253
254	vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage2 15
255	---
256	No SOLRURL specified. Skipping indexing.
257	Injecting seed URLs
258
259	...
260
261	Thu Oct 3 09:22:23 UTC 2019 : Iteration 6 of 15
262	Generating batchId
263	Generating a new fetchlist
264	...
265	Generating batchId
266	Generating a new fetchlist
267	/home/vagrant/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId davidbHomePage2 -batchId 1570094569-27637
268	GeneratorJob: starting at 2019-10-03 09:22:49
269	GeneratorJob: Selecting best-scoring urls due for fetch.
270	GeneratorJob: starting
271	GeneratorJob: filtering: false
272	GeneratorJob: normalizing: false
273	GeneratorJob: topN: 50000
274	GeneratorJob: finished at 2019-10-03 09:22:52, time elapsed: 00:00:02
275	GeneratorJob: generated batch id: 1570094569-27637 containing 0 URLs
276	Generate returned 1 (no new segments created)
277	Escaping loop: no more URLs to fetch now
278	vagrant@node2:~/apache-nutch-2.3.1/runtime/local$
279	---
280
281	* running readdb -stats show 44 URLs fetched, just as first time (when crawlId had been "davidbHomePage"):
282
283	vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage2
284	---
285	WebTable statistics start
286	Statistics for WebTable:
287	retry 0: 44
288	status 5 (status_redir_perm): 4
289	status 3 (status_gone): 1
290	status 2 (status_fetched): 39
291	jobs: {[davidbHomePage2]db_stats-job_local985519583_0001={jobName=[davidbHomePage2]db_stats, jobID=job_local985519583_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=4, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=552599552}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788152, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
292	TOTAL urls: 44
293	---
294
295	----------------------------------------------------------------------
296	Testing URLFilters: testing a URL to see if it's accepted
297	----------------------------------------------------------------------
298	Use the command
299	./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
300	(mentioned at https://lucene.472066.n3.nabble.com/Correct-syntax-for-regex-urlfilter-txt-trying-to-exclude-single-path-results-td3600376.html)
301
302	Use as follows:
303
304	cd apache-nutch-2.3.1/runtime/local
305
306	./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
307
308	Then paste the URL you want to test, press Enter.
309	A + in front of response means accepted
310	A - in front of response means rejected.
311	Can continue pasting URLs to test against filters until you send Ctrl-D to terminate input.
312
313
314
315
316	-------------------
317	Dr Nichols's suggestion: can store listing of potential product sites to inspect by checking url for /mi in combination with whether the domain's IP geolocates to OUTSIDE New Zealand (tld nz).
318	* https://stackoverflow.com/questions/1415851/best-way-to-get-geo-location-in-java
319	- https://mvnrepository.com/artifact/com.maxmind.geoip/geoip-api/1.2.10
320	- older .dat.gz file is archived at https://web.archive.org/web/20180917084618/http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
321	- and newer geo country data at https://dev.maxmind.com/geoip/geoip2/geolite2/
322	* https://dev.maxmind.com/geoip/geoip2/geolite2/
323	* older GeoIp API (has LookupService): https://github.com/maxmind/geoip-api-java
324	* Newer GeoIp2 API: https://dev.maxmind.com/geoip/geoip2/downloadable/#MaxMind_APIs
325	and https://maxmind.github.io/GeoIP2-java/doc/v2.12.0/
326	* https://maxmind.github.io/GeoIP2-java/
327	* https://github.com/AtlasOfLivingAustralia/ala-hub/issues/11
328
329
330	---
331	https://check-host.net/ip-info
332	https://ipinfo.info/html/ip_checker.php
333
334
335
336	----------
337	MongoDB
338	Installation:
339	https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
340	https://docs.mongodb.com/manual/administration/install-on-linux/
341	https://hevodata.com/blog/install-mongodb-on-ubuntu/
342	https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04
343	CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/
344	FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source
345	GUI:
346	https://robomongo.org/
347	Robomongo is Robo 3T now
348
349	https://www.tutorialspoint.com/mongodb/mongodb_java.htm
350	JAR FILE:
351	http://central.maven.org/maven2/org/mongodb/mongo-java-driver/
352	https://mongodb.github.io/mongo-java-driver/
353
354
355	INSTALLING THE MONGODB SERVER AND MONGO CLIENT ON LINUX
356	Need to have sudo and root powers.
357
358	https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
359	http://www.programmersought.com/article/6500308940/
360
361	52 sudo apt-get install mongodb-clients
362	53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
363
364	Failed with
365	Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
366	exception: connect failed
367
368	This is due to a version incompatibility between Client and mongodb Server.
369	The solution is to follow instructions at http://www.programmersought.com/article/6500308940/
370	and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
371	as below:
372
373	54 sudo apt-get purge mongodb-clients
374	55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
375	56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" \| sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
376	57 sudo apt-get update
377	58 sudo apt-get install mongodb-clients
378	59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
379	(still doesn't work)
380	60 sudo apt-get install -y mongodb-org
381	The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server?
382	72 sudo service mongod status
383
384	103 sudo service mongod start
385	"mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections
386	104 sudo service mongod status
387	88 sudo service mongod stop
388
389
390	RUNNING AND USING THE MONGO CLIENT SHELL:
391	Among the many things you can do with the Mongo client shell, one can use it to find the mongo client version (which is the version of the shell) and the mongo db version.
392
393	To run the mongo client shell WITHOUT loading a db:
394
395
396	wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb
397	MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION
398
399	type "help" for help
400	> help
401	db.help() help on db methods
402	db.mycoll.help() help on collection methods
403	sh.help() sharding helpers
404	rs.help() replica set helpers
405	help admin administrative help
406	help connect connecting to a db help
407	help keys key shortcuts
408	help misc misc things to know
409	help mr mapreduce
410
411	show dbs show database names
412	show collections show collections in current database
413	show users show users in current database
414	show profile show most recent system.profile entries with time >= 1ms
415	show logs show the accessible logger names
416	show log [name] prints out the last segment of log in memory, 'global' is default
417	use <db_name> set current database
418	db.foo.find() list objects in collection foo
419	db.foo.find( { a : 1 } ) list objects in foo where a == 1
420	it result of the last line evaluated; use to further iterate
421	DBQuery.shellBatchSize = x set default number of items to display on shell
422	exit quit the mongo shell
423
424	> help connect
425
426	Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options.
427	Additional connections may be opened:
428
429	var x = new Mongo('host[:port]');
430	var mydb = x.getDB('mydb');
431	or
432	var mydb = connect('host[:port]/mydb');
433
434	Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
435
436	Getting help on connect options:
437
438	> var x = new Mongo('mongodb.cms.waikato.ac.nz:27017');
439	> var mydb = x.getDB('anupama');
440
441	> mydb.connect.help()
442	DBCollection help
443	db.connect.find().help() - show DBCursor help
444	db.connect.count()
445	db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
446	db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command
447	db.connect.dataSize()
448	db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' )
449	db.connect.drop() drop the collection
450	db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } )
451	db.connect.dropIndexes()
452	db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
453	db.connect.reIndex()
454	db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
455	e.g. db.connect.find( {x:77} , {name:1, x:1} )
456	db.connect.find(...).count()
457	db.connect.find(...).limit(n)
458	db.connect.find(...).skip(n)
459	db.connect.find(...).sort(...)
460	db.connect.findOne([query])
461	db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
462	db.connect.getDB() get DB object associated with collection
463	db.connect.getPlanCache() get query plan cache associated with collection
464	db.connect.getIndexes()
465	db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
466	db.connect.insert(obj)
467	db.connect.mapReduce( mapFunction , reduceFunction , <optional params> )
468	db.connect.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
469	db.connect.remove(query)
470	db.connect.renameCollection( newName , <dropTarget> ) renames the collection.
471	db.connect.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
472	db.connect.save(obj)
473	db.connect.stats()
474	db.connect.storageSize() - includes free space allocated to this collection
475	db.connect.totalIndexSize() - size in bytes of all the indexes
476	db.connect.totalSize() - storage allocated for all data and indexes
477	db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi
478	db.connect.validate( <full> ) - SLOW
479	db.connect.getShardVersion() - only for use with sharding
480	db.connect.getShardDistribution() - prints statistics about data distribution in the cluster
481	db.connect.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
482	db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
483	db.connect.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
484	db.connect.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
485	> mydb.version()
486	4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION
487
488	(Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb)
489
490	Finally we now know the mongodb server version: 4.0.13
491	This version didn't work with our mongo client (shell) version of 2.6.10. And that's we had to upgrade the client.
492
493
494	INSTALLATION MONGO-DB AND CLIENT
495	FROM: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
496	wget -qO - https://www.mongodb.org/static/pgp/server-4.2.asc \| sudo apt-key add -
497	echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.2 multiverse" \| sudo tee /etc/apt/sources.list.d/mongodb-org-4.2.list
498	sudo apt-get update
499	sudo apt-get install -y mongodb-org
500
501	UNINSTALLING
502	https://www.anintegratedworld.com/uninstall-mongodb-in-ubuntu-via-command-line-in-3-easy-steps/
503
504
505	MONGO DB ROBO 3T
506	1. Download "Double Pack" from https://robomongo.org/
507	2. Untar its contents. Then untar the tarball in that.
508	3. Run:
509	wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t
510

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt@ 33623

Download in other formats: