1 | https://codereview.stackexchange.com/questions/198343/crawl-and-gather-all-the-urls-recursively-in-a-domain
|
---|
2 | http://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
|
---|
3 |
|
---|
4 | https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
|
---|
5 |
|
---|
6 | https://cwiki.apache.org/confluence/display/nutch/
|
---|
7 | https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling
|
---|
8 | https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
|
---|
9 |
|
---|
10 | https://moz.com/top500
|
---|
11 | -----------
|
---|
12 | NUTCH
|
---|
13 | -----------
|
---|
14 | https://stackoverflow.com/questions/35449673/nutch-and-solr-indexing-blacklist-domain
|
---|
15 | https://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/urlfilter/domainblacklist/DomainBlacklistURLFilter.html
|
---|
16 |
|
---|
17 | https://lucene.472066.n3.nabble.com/blacklist-for-crawling-td618343.html
|
---|
18 | https://lucene.472066.n3.nabble.com/Content-of-size-X-was-truncated-to-Y-td4003517.html
|
---|
19 |
|
---|
20 |
|
---|
21 | Google: nutch mirror web site
|
---|
22 | https://stackoverflow.com/questions/33354460/nutch-clone-website
|
---|
23 | [https://stackoverflow.com/questions/35714897/nutch-not-crawling-entire-website
|
---|
24 | fetch -all seems to be a nutch v2 thing?]
|
---|
25 |
|
---|
26 | Google (30 Sep): site mirroring with nutch
|
---|
27 | https://grokbase.com/t/nutch/user/125sfbg0pt/using-nutch-for-web-site-mirroring
|
---|
28 | https://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
|
---|
29 | http://www.cs.ucy.ac.cy/courses/EPL660/lectures/lab6.pdf
|
---|
30 | slide p.5 onwards
|
---|
31 |
|
---|
32 | crawler softw options: https://repositorio.iscte-iul.pt/bitstream/10071/2871/1/Building%20a%20Scalable%20Index%20and%20Web%20Search%20Engine%20for%20Music%20on.pdf
|
---|
33 | See also p.20. HTTrack
|
---|
34 |
|
---|
35 |
|
---|
36 | Google: nutch performance tuning
|
---|
37 | * https://stackoverflow.com/questions/24383212/apache-nutch-performance-tuning-for-whole-web-crawling
|
---|
38 | * https://stackoverflow.com/questions/4871972/how-to-speed-up-crawling-in-nutch
|
---|
39 | * https://cwiki.apache.org/confluence/display/nutch/OptimizingCrawls
|
---|
40 |
|
---|
41 | NUTCH INSTALLATION:
|
---|
42 | * Nutch v1: https://cwiki.apache.org/confluence/display/nutch/NutchTutorial#NutchTutorial-SetupSolrforsearch
|
---|
43 |
|
---|
44 | Nutch v2 installation and set up:
|
---|
45 | * https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial
|
---|
46 | * https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783286850/1/ch01lvl1sec09/installing-and-configuring-apache-nutch
|
---|
47 |
|
---|
48 |
|
---|
49 | Nutch doesn't work with spark (yet):
|
---|
50 | https://stackoverflow.com/questions/29950299/distributed-web-crawling-using-apache-spark-is-it-possible
|
---|
51 |
|
---|
52 | SOLR:
|
---|
53 | * Query syntax: http://www.solrtutorial.com/solr-query-syntax.html
|
---|
54 | * Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html
|
---|
55 |
|
---|
56 |
|
---|
57 | * If you change a nutch 2 configuration, https://stackoverflow.com/questions/16401667/java-lang-classnotfoundexception-org-apache-gora-hbase-store-hbasestore
|
---|
58 | explains you can rebuild nutch with:
|
---|
59 | cd <apache-nutch>
|
---|
60 | ant clean
|
---|
61 | ant runtime
|
---|
62 | ----------------------------------
|
---|
63 | Apache Nutch 2 with newer HBase
|
---|
64 |
|
---|
65 | hbase-common-1.4.8.jar
|
---|
66 |
|
---|
67 | 1. hbase jar files need to go into runtime/local/lib
|
---|
68 |
|
---|
69 | But not slf4j-log4j12-1.7.10.jar (there's already a slf4j-log4j12-1.7.5.jar) - so remove that from runtime/local/lib after copying it over.
|
---|
70 |
|
---|
71 | 2. https://stackoverflow.com/questions/46340416/how-to-compile-nutch-2-3-1-with-hbase-1-2-6
|
---|
72 | https://stackoverflow.com/questions/39834423/apache-nutch-fetcherjob-throws-nosuchelementexception-deep-in-gora/39837926#39837926
|
---|
73 |
|
---|
74 | Unfortunately, the page https://paste.apache.org/jjqz referred to above that contains patches for using Gora 0.7 is no longer available.
|
---|
75 |
|
---|
76 | http://mail-archives.apache.org/mod_mbox/nutch-user/201602.mbox/%[email protected]%3E
|
---|
77 |
|
---|
78 | https://www.mail-archive.com/[email protected]/msg14245.html
|
---|
79 |
|
---|
80 | ------------------------------------------------------------------------------
|
---|
81 | Other way: Nutch on its own vagrant with specified hbase or nutch with mongodb
|
---|
82 | ------------------------------------------------------------------------------
|
---|
83 | * https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/
|
---|
84 | * https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
|
---|
85 |
|
---|
86 | The older but recommended hbase 0.98.21 for hadoop 2 can be downloaded from https://archive.apache.org/dist/hbase/0.98.21/
|
---|
87 |
|
---|
88 | -----
|
---|
89 | HBASE commands
|
---|
90 | /usr/local/hbase/bin/hbase shell
|
---|
91 | https://learnhbase.net/2013/03/02/hbase-shell-commands/
|
---|
92 | http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/
|
---|
93 | dropping tables: https://www.tutorialspoint.com/hbase/hbase_drop_table.htm
|
---|
94 |
|
---|
95 | > list
|
---|
96 |
|
---|
97 | davidbHomePage_webpage is a table
|
---|
98 |
|
---|
99 | > get 'davidbHomePage_webpage', '1'
|
---|
100 |
|
---|
101 | Solution to get a working nutch2:
|
---|
102 | get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz
|
---|
103 | And follow the instructions in my README file in there.
|
---|
104 |
|
---|
105 | ---------------------------------------------------------------------
|
---|
106 | ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
|
---|
107 | ---------------------------------------------------------------------
|
---|
108 | => https://anarc.at/services/archive/web/
|
---|
109 | Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
|
---|
110 | https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
|
---|
111 | https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
|
---|
112 | To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
|
---|
113 | https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
|
---|
114 | https://alternativeto.net/software/apache-nutch/
|
---|
115 | https://alternativeto.net/software/wget/
|
---|
116 | https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
|
---|
117 | https://github.com/ArchiveTeam/wpull
|
---|
118 |
|
---|
119 | -------------------
|
---|
120 |
|
---|
121 | Running nutch 2.x
|
---|
122 |
|
---|
123 | -------------------
|
---|
124 |
|
---|
125 | LINKS
|
---|
126 |
|
---|
127 | https://lucene.472066.n3.nabble.com/Nutch-2-x-readdb-command-dump-td4033937.html
|
---|
128 | https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
|
---|
129 |
|
---|
130 |
|
---|
131 | https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/ ## most useful for running nutch 2.x crawls
|
---|
132 |
|
---|
133 | https://www.mobomo.com/2017/06/the-basics-working-with-nutch-2-x/
|
---|
134 | "Fetch
|
---|
135 |
|
---|
136 | This is where the magic happens. During the fetch step, Nutch crawls the urls selected in the generate step. The most important argument you need is -threads: this sets the number of fetcher threads per task. Increasing this will make crawling faster, but setting it too high can overwhelm a site and it might shut out your crawler, as well as take up too much memory from your machine. Run it like this:
|
---|
137 | $ nutch fetch -threads 50"
|
---|
138 |
|
---|
139 |
|
---|
140 | https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-nutch-tutorial/
|
---|
141 | https://www.yegor256.com/2019/04/17/nutch-from-java.html
|
---|
142 |
|
---|
143 | http://nutch.sourceforge.net/docs/en/tutorial.html
|
---|
144 |
|
---|
145 | Intranet: Configuration
|
---|
146 | To configure things for intranet crawling you must:
|
---|
147 |
|
---|
148 | Create a flat file of root urls. For example, to crawl the nutch.org site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like:
|
---|
149 |
|
---|
150 | http://www.nutch.org/
|
---|
151 |
|
---|
152 | Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read:
|
---|
153 |
|
---|
154 | +^http://([a-z0-9]*\.)*nutch.org/
|
---|
155 |
|
---|
156 | This will include any url in the domain nutch.org.
|
---|
157 |
|
---|
158 | Intranet: Running the Crawl
|
---|
159 | Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
|
---|
160 |
|
---|
161 | -dir dir names the directory to put the crawl in.
|
---|
162 | -depth depth indicates the link depth from the root page that should be crawled.
|
---|
163 | -delay delay determines the number of seconds between accesses to each host.
|
---|
164 | -threads threads determines the number of threads that will fetch in parallel.
|
---|
165 |
|
---|
166 | For example, a typical call might be:
|
---|
167 |
|
---|
168 | bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
|
---|
169 |
|
---|
170 | Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10. <===========
|
---|
171 |
|
---|
172 | Once crawling has completed, one can skip to the Searching section below.
|
---|
173 |
|
---|
174 |
|
---|
175 | -----------------------------------
|
---|
176 | Actually running nutch 2.x - steps
|
---|
177 | -----------------------------------
|
---|
178 | MANUALLY GOING THROUGH THE CYCLE 3 TIMES:
|
---|
179 |
|
---|
180 | cd ~/apache-nutch-2.3.1/runtime/local
|
---|
181 |
|
---|
182 | ./bin/nutch inject urls
|
---|
183 |
|
---|
184 | ./bin/nutch generate -topN 50
|
---|
185 | ./bin/nutch fetch -all
|
---|
186 | ./bin/nutch parse -all
|
---|
187 | ./bin/nutch updatedb -all
|
---|
188 |
|
---|
189 | ./bin/nutch generate -topN 50
|
---|
190 | ./bin/nutch fetch -all
|
---|
191 | ./bin/nutch parse -all
|
---|
192 | ./bin/nutch updatedb -all
|
---|
193 |
|
---|
194 | ./bin/nutch generate -topN 50
|
---|
195 | ./bin/nutch fetch -all
|
---|
196 | ./bin/nutch parse -all
|
---|
197 | ./bin/nutch updatedb -all
|
---|
198 |
|
---|
199 | Dump output on local filesystem:
|
---|
200 | rm -rf /tmp/bla
|
---|
201 | ./bin/nutch readdb -dump /tmp/bla [-crawlId ID -text]
|
---|
202 | less /tmp/bla/part-r-00000
|
---|
203 |
|
---|
204 | To dump output on local filesystem:
|
---|
205 | Need hdfs host name if sending/dumping nutch crawl output to a location on hdfs
|
---|
206 | Host is defined in /usr/local/hadoop/etc/hadoop/core-site.xml for property fs.defaultFS, (https://stackoverflow.com/questions/27956973/java-io-ioexception-incomplete-hdfs-uri-no-host)
|
---|
207 | host is hdfs://node2/ in this case.
|
---|
208 | So:
|
---|
209 |
|
---|
210 | hdfs dfs -rmdir /user/vagrant/dump
|
---|
211 | XXX ./bin/nutch readdb -dump user/vagrant/dump -text ### won't work
|
---|
212 | XXX ./bin/nutch readdb -dump hdfs:///user/vagrant/dump -text ### won't work
|
---|
213 | ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump -text
|
---|
214 |
|
---|
215 |
|
---|
216 | USING THE SCRIPT TO ATTEMPT TO CRAWL A SITE
|
---|
217 | * Choosing to repeat the cycle 10 times because, as per http://nutch.sourceforge.net/docs/en/tutorial.html
|
---|
218 |
|
---|
219 | "Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10."
|
---|
220 |
|
---|
221 | * Use the ./bin/crawl script, provide the seed urls dir, the crawlId and number of times to repeat = 10
|
---|
222 | vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage 10
|
---|
223 |
|
---|
224 |
|
---|
225 | * View the downloaded crawls.
|
---|
226 | This time need to provide crawlId to readdb, in order to get a dump of its text contents:
|
---|
227 | hdfs dfs -rm -r hdfs://node2/user/vagrant/dump2
|
---|
228 | ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump2 -text -crawlId davidbHomePage
|
---|
229 |
|
---|
230 | * View the contents:
|
---|
231 | hdfs dfs -cat hdfs://node2/user/vagrant/dump2/part-r-*
|
---|
232 |
|
---|
233 |
|
---|
234 | * FIND OUT NUMBER OF URLS DOWNLOADED FOR THE SITE:
|
---|
235 | vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage
|
---|
236 | WebTable statistics start
|
---|
237 | Statistics for WebTable:
|
---|
238 | retry 0: 44
|
---|
239 | status 5 (status_redir_perm): 4
|
---|
240 | status 3 (status_gone): 1
|
---|
241 | status 2 (status_fetched): 39
|
---|
242 | jobs: {[davidbHomePage]db_stats-job_local647846559_0001={jobName=[davidbHomePage]db_stats, jobID=job_local647846559_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=0, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=595591168}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788140, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
|
---|
243 | TOTAL urls: 44
|
---|
244 | max score: 1.0
|
---|
245 | avg score: 0.022727273
|
---|
246 | min score: 0.0
|
---|
247 | WebTable statistics: done
|
---|
248 |
|
---|
249 | ------------------------------------
|
---|
250 | STOPPING CONDITION
|
---|
251 | Seems inbuilt
|
---|
252 | * When I tell it to cycle 15 times, it stops after 6 cycles saying no more URLs to fetch:
|
---|
253 |
|
---|
254 | vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage2 15
|
---|
255 | ---
|
---|
256 | No SOLRURL specified. Skipping indexing.
|
---|
257 | Injecting seed URLs
|
---|
258 |
|
---|
259 | ...
|
---|
260 |
|
---|
261 | Thu Oct 3 09:22:23 UTC 2019 : Iteration 6 of 15
|
---|
262 | Generating batchId
|
---|
263 | Generating a new fetchlist
|
---|
264 | ...
|
---|
265 | Generating batchId
|
---|
266 | Generating a new fetchlist
|
---|
267 | /home/vagrant/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId davidbHomePage2 -batchId 1570094569-27637
|
---|
268 | GeneratorJob: starting at 2019-10-03 09:22:49
|
---|
269 | GeneratorJob: Selecting best-scoring urls due for fetch.
|
---|
270 | GeneratorJob: starting
|
---|
271 | GeneratorJob: filtering: false
|
---|
272 | GeneratorJob: normalizing: false
|
---|
273 | GeneratorJob: topN: 50000
|
---|
274 | GeneratorJob: finished at 2019-10-03 09:22:52, time elapsed: 00:00:02
|
---|
275 | GeneratorJob: generated batch id: 1570094569-27637 containing 0 URLs
|
---|
276 | Generate returned 1 (no new segments created)
|
---|
277 | Escaping loop: no more URLs to fetch now
|
---|
278 | vagrant@node2:~/apache-nutch-2.3.1/runtime/local$
|
---|
279 | ---
|
---|
280 |
|
---|
281 | * running readdb -stats show 44 URLs fetched, just as first time (when crawlId had been "davidbHomePage"):
|
---|
282 |
|
---|
283 | vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage2
|
---|
284 | ---
|
---|
285 | WebTable statistics start
|
---|
286 | Statistics for WebTable:
|
---|
287 | retry 0: 44
|
---|
288 | status 5 (status_redir_perm): 4
|
---|
289 | status 3 (status_gone): 1
|
---|
290 | status 2 (status_fetched): 39
|
---|
291 | jobs: {[davidbHomePage2]db_stats-job_local985519583_0001={jobName=[davidbHomePage2]db_stats, jobID=job_local985519583_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=4, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=552599552}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788152, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
|
---|
292 | TOTAL urls: 44
|
---|
293 | ---
|
---|
294 |
|
---|
295 | ----------------------------------------------------------------------
|
---|
296 | Testing URLFilters: testing a URL to see if it's accepted
|
---|
297 | ----------------------------------------------------------------------
|
---|
298 | Use the command
|
---|
299 | ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
|
---|
300 | (mentioned at https://lucene.472066.n3.nabble.com/Correct-syntax-for-regex-urlfilter-txt-trying-to-exclude-single-path-results-td3600376.html)
|
---|
301 |
|
---|
302 | Use as follows:
|
---|
303 |
|
---|
304 | cd apache-nutch-2.3.1/runtime/local
|
---|
305 |
|
---|
306 | ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
|
---|
307 |
|
---|
308 | Then paste the URL you want to test, press Enter.
|
---|
309 | A + in front of response means accepted
|
---|
310 | A - in front of response means rejected.
|
---|
311 | Can continue pasting URLs to test against filters until you send Ctrl-D to terminate input.
|
---|
312 |
|
---|
313 |
|
---|
314 |
|
---|
315 |
|
---|
316 | -------------------
|
---|
317 | Dr Nichols's suggestion: can store listing of potential product sites to inspect by checking url for /mi in combination with whether the domain's IP geolocates to OUTSIDE New Zealand (tld nz).
|
---|
318 | * https://stackoverflow.com/questions/1415851/best-way-to-get-geo-location-in-java
|
---|
319 | - https://mvnrepository.com/artifact/com.maxmind.geoip/geoip-api/1.2.10
|
---|
320 | - older .dat.gz file is archived at https://web.archive.org/web/20180917084618/http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
|
---|
321 | - and newer geo country data at https://dev.maxmind.com/geoip/geoip2/geolite2/
|
---|
322 | * https://dev.maxmind.com/geoip/geoip2/geolite2/
|
---|
323 | * older GeoIp API (has LookupService): https://github.com/maxmind/geoip-api-java
|
---|
324 | * Newer GeoIp2 API: https://dev.maxmind.com/geoip/geoip2/downloadable/#MaxMind_APIs
|
---|
325 | and https://maxmind.github.io/GeoIP2-java/doc/v2.12.0/
|
---|
326 | * https://maxmind.github.io/GeoIP2-java/
|
---|
327 | * https://github.com/AtlasOfLivingAustralia/ala-hub/issues/11
|
---|
328 |
|
---|
329 |
|
---|
330 | ---
|
---|
331 | https://check-host.net/ip-info
|
---|
332 | https://ipinfo.info/html/ip_checker.php
|
---|
333 |
|
---|
334 |
|
---|
335 |
|
---|
336 | ----------
|
---|
337 | MongoDB
|
---|
338 | Installation:
|
---|
339 | https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
340 | https://docs.mongodb.com/manual/administration/install-on-linux/
|
---|
341 | https://hevodata.com/blog/install-mongodb-on-ubuntu/
|
---|
342 | https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04
|
---|
343 | CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/
|
---|
344 | FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source
|
---|
345 | GUI:
|
---|
346 | https://robomongo.org/
|
---|
347 | Robomongo is Robo 3T now
|
---|
348 |
|
---|
349 | https://www.tutorialspoint.com/mongodb/mongodb_java.htm
|
---|
350 | JAR FILE:
|
---|
351 | http://central.maven.org/maven2/org/mongodb/mongo-java-driver/
|
---|
352 | https://mongodb.github.io/mongo-java-driver/
|
---|
353 |
|
---|
354 |
|
---|
355 | INSTALLING THE MONGODB SERVER AND MONGO CLIENT ON LINUX
|
---|
356 | Need to have sudo and root powers.
|
---|
357 |
|
---|
358 | https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
359 | http://www.programmersought.com/article/6500308940/
|
---|
360 |
|
---|
361 | 52 sudo apt-get install mongodb-clients
|
---|
362 | 53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
363 |
|
---|
364 | Failed with
|
---|
365 | Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148
|
---|
366 | exception: connect failed
|
---|
367 |
|
---|
368 | This is due to a version incompatibility between Client and mongodb Server.
|
---|
369 | The solution is to follow instructions at http://www.programmersought.com/article/6500308940/
|
---|
370 | and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
371 | as below:
|
---|
372 |
|
---|
373 | 54 sudo apt-get purge mongodb-clients
|
---|
374 | 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
|
---|
375 | 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
|
---|
376 | 57 sudo apt-get update
|
---|
377 | 58 sudo apt-get install mongodb-clients
|
---|
378 | 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p
|
---|
379 | (still doesn't work)
|
---|
380 | 60 sudo apt-get install -y mongodb-org
|
---|
381 | The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server?
|
---|
382 | 72 sudo service mongod status
|
---|
383 |
|
---|
384 | 103 sudo service mongod start
|
---|
385 | "mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections
|
---|
386 | 104 sudo service mongod status
|
---|
387 | 88 sudo service mongod stop
|
---|
388 |
|
---|
389 |
|
---|
390 | RUNNING AND USING THE MONGO CLIENT SHELL:
|
---|
391 | Among the many things you can do with the Mongo client shell, one can use it to find the mongo client version (which is the version of the shell) and the mongo db version.
|
---|
392 |
|
---|
393 | To run the mongo client shell WITHOUT loading a db:
|
---|
394 |
|
---|
395 |
|
---|
396 | wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb
|
---|
397 | MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION
|
---|
398 |
|
---|
399 | type "help" for help
|
---|
400 | > help
|
---|
401 | db.help() help on db methods
|
---|
402 | db.mycoll.help() help on collection methods
|
---|
403 | sh.help() sharding helpers
|
---|
404 | rs.help() replica set helpers
|
---|
405 | help admin administrative help
|
---|
406 | help connect connecting to a db help
|
---|
407 | help keys key shortcuts
|
---|
408 | help misc misc things to know
|
---|
409 | help mr mapreduce
|
---|
410 |
|
---|
411 | show dbs show database names
|
---|
412 | show collections show collections in current database
|
---|
413 | show users show users in current database
|
---|
414 | show profile show most recent system.profile entries with time >= 1ms
|
---|
415 | show logs show the accessible logger names
|
---|
416 | show log [name] prints out the last segment of log in memory, 'global' is default
|
---|
417 | use <db_name> set current database
|
---|
418 | db.foo.find() list objects in collection foo
|
---|
419 | db.foo.find( { a : 1 } ) list objects in foo where a == 1
|
---|
420 | it result of the last line evaluated; use to further iterate
|
---|
421 | DBQuery.shellBatchSize = x set default number of items to display on shell
|
---|
422 | exit quit the mongo shell
|
---|
423 |
|
---|
424 | > help connect
|
---|
425 |
|
---|
426 | Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options.
|
---|
427 | Additional connections may be opened:
|
---|
428 |
|
---|
429 | var x = new Mongo('host[:port]');
|
---|
430 | var mydb = x.getDB('mydb');
|
---|
431 | or
|
---|
432 | var mydb = connect('host[:port]/mydb');
|
---|
433 |
|
---|
434 | Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
|
---|
435 |
|
---|
436 | Getting help on connect options:
|
---|
437 |
|
---|
438 | > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017');
|
---|
439 | > var mydb = x.getDB('anupama');
|
---|
440 |
|
---|
441 | > mydb.connect.help()
|
---|
442 | DBCollection help
|
---|
443 | db.connect.find().help() - show DBCursor help
|
---|
444 | db.connect.count()
|
---|
445 | db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
|
---|
446 | db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command
|
---|
447 | db.connect.dataSize()
|
---|
448 | db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' )
|
---|
449 | db.connect.drop() drop the collection
|
---|
450 | db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } )
|
---|
451 | db.connect.dropIndexes()
|
---|
452 | db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
|
---|
453 | db.connect.reIndex()
|
---|
454 | db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
|
---|
455 | e.g. db.connect.find( {x:77} , {name:1, x:1} )
|
---|
456 | db.connect.find(...).count()
|
---|
457 | db.connect.find(...).limit(n)
|
---|
458 | db.connect.find(...).skip(n)
|
---|
459 | db.connect.find(...).sort(...)
|
---|
460 | db.connect.findOne([query])
|
---|
461 | db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
|
---|
462 | db.connect.getDB() get DB object associated with collection
|
---|
463 | db.connect.getPlanCache() get query plan cache associated with collection
|
---|
464 | db.connect.getIndexes()
|
---|
465 | db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
|
---|
466 | db.connect.insert(obj)
|
---|
467 | db.connect.mapReduce( mapFunction , reduceFunction , <optional params> )
|
---|
468 | db.connect.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
|
---|
469 | db.connect.remove(query)
|
---|
470 | db.connect.renameCollection( newName , <dropTarget> ) renames the collection.
|
---|
471 | db.connect.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
|
---|
472 | db.connect.save(obj)
|
---|
473 | db.connect.stats()
|
---|
474 | db.connect.storageSize() - includes free space allocated to this collection
|
---|
475 | db.connect.totalIndexSize() - size in bytes of all the indexes
|
---|
476 | db.connect.totalSize() - storage allocated for all data and indexes
|
---|
477 | db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi
|
---|
478 | db.connect.validate( <full> ) - SLOW
|
---|
479 | db.connect.getShardVersion() - only for use with sharding
|
---|
480 | db.connect.getShardDistribution() - prints statistics about data distribution in the cluster
|
---|
481 | db.connect.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
|
---|
482 | db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
|
---|
483 | db.connect.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
|
---|
484 | db.connect.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
|
---|
485 | > mydb.version()
|
---|
486 | 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION
|
---|
487 |
|
---|
488 | (Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb)
|
---|
489 |
|
---|
490 | Finally we now know the mongodb server version: 4.0.13
|
---|
491 | This version didn't work with our mongo client (shell) version of 2.6.10. And that's we had to upgrade the client.
|
---|
492 |
|
---|
493 |
|
---|
494 | INSTALLATION MONGO-DB AND CLIENT
|
---|
495 | FROM: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
|
---|
496 | wget -qO - https://www.mongodb.org/static/pgp/server-4.2.asc | sudo apt-key add -
|
---|
497 | echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.2 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.2.list
|
---|
498 | sudo apt-get update
|
---|
499 | sudo apt-get install -y mongodb-org
|
---|
500 |
|
---|
501 | UNINSTALLING
|
---|
502 | https://www.anintegratedworld.com/uninstall-mongodb-in-ubuntu-via-command-line-in-3-easy-steps/
|
---|
503 |
|
---|
504 |
|
---|
505 | MONGO DB ROBO 3T
|
---|
506 | 1. Download "Double Pack" from https://robomongo.org/
|
---|
507 | 2. Untar its contents. Then untar the tarball in that.
|
---|
508 | 3. Run:
|
---|
509 | wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t
|
---|
510 |
|
---|