Changeset 33545 for gs3-extensions/maori-lang-detection
- Timestamp:
- 2019-10-03T22:38:00+13:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection
- Files:
-
- 3 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt
r33499 r33545 3 3 https://www.guru99.com/create-your-first-hadoop-program.html 4 4 5 Some Hadoop commands 6 * https://community.cloudera.com/t5/Support-Questions/Closed-How-to-store-output-of-shell-script-in-HDFS/td-p/229933 7 * https://stackoverflow.com/questions/26513861/checking-if-directory-in-hdfs-already-exists-or-not 5 8 -------------- 6 9 To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics: -
gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt
r33541 r33545 76 76 /usr/local/hbase/bin/hbase shell 77 77 https://learnhbase.net/2013/03/02/hbase-shell-commands/ 78 79 80 list78 http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/ 79 80 > list 81 81 82 82 davidbHomePage_webpage is a table 83 83 84 > get 'davidbHomePage_webpage', '1' 84 85 85 86 Solution to get a working nutch2: … … 101 102 https://github.com/ArchiveTeam/wpull 102 103 104 ------------------- 105 106 Running nutch 2.x 107 108 ------------------- 109 110 LINKS 111 112 https://lucene.472066.n3.nabble.com/Nutch-2-x-readdb-command-dump-td4033937.html 113 https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions 114 115 116 https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/ ## most useful for running nutch 2.x crawls 117 118 https://www.mobomo.com/2017/06/the-basics-working-with-nutch-2-x/ 119 "Fetch 120 121 This is where the magic happens. During the fetch step, Nutch crawls the urls selected in the generate step. The most important argument you need is -threads: this sets the number of fetcher threads per task. Increasing this will make crawling faster, but setting it too high can overwhelm a site and it might shut out your crawler, as well as take up too much memory from your machine. Run it like this: 122 $ nutch fetch -threads 50" 123 124 125 https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-nutch-tutorial/ 126 https://www.yegor256.com/2019/04/17/nutch-from-java.html 127 128 http://nutch.sourceforge.net/docs/en/tutorial.html 129 130 Intranet: Configuration 131 To configure things for intranet crawling you must: 132 133 Create a flat file of root urls. For example, to crawl the nutch.org site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like: 134 135 http://www.nutch.org/ 136 137 Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read: 138 139 +^http://([a-z0-9]*\.)*nutch.org/ 140 141 This will include any url in the domain nutch.org. 142 143 Intranet: Running the Crawl 144 Once things are configured, running the crawl is easy. Just use the crawl command. Its options include: 145 146 -dir dir names the directory to put the crawl in. 147 -depth depth indicates the link depth from the root page that should be crawled. 148 -delay delay determines the number of seconds between accesses to each host. 149 -threads threads determines the number of threads that will fetch in parallel. 150 151 For example, a typical call might be: 152 153 bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log 154 155 Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10. <=========== 156 157 Once crawling has completed, one can skip to the Searching section below. 158 159 160 ----------------------------------- 161 Actually running nutch 2.x - steps 162 ----------------------------------- 163 MANUALLY GOING THROUGH THE CYCLE 3 TIMES: 164 165 cd ~/apache-nutch-2.3.1/runtime/local 166 167 ./bin/nutch inject urls 168 169 ./bin/nutch generate -topN 50 170 ./bin/nutch fetch -all 171 ./bin/nutch parse -all 172 ./bin/nutch updatedb -all 173 174 ./bin/nutch generate -topN 50 175 ./bin/nutch fetch -all 176 ./bin/nutch parse -all 177 ./bin/nutch updatedb -all 178 179 ./bin/nutch generate -topN 50 180 ./bin/nutch fetch -all 181 ./bin/nutch parse -all 182 ./bin/nutch updatedb -all 183 184 Dump output on local filesystem: 185 rm -rf /tmp/bla 186 ./bin/nutch readdb -dump /tmp/bla 187 less /tmp/bla/part-r-00000 188 189 To dump output on local filesystem: 190 Need hdfs host name if sending/dumping nutch crawl output to a location on hdfs 191 Host is defined in /usr/local/hadoop/etc/hadoop/core-site.xml for property fs.defaultFS, (https://stackoverflow.com/questions/27956973/java-io-ioexception-incomplete-hdfs-uri-no-host) 192 host is hdfs://node2/ in this case. 193 So: 194 195 hdfs dfs -rmdir /user/vagrant/dump 196 XXX ./bin/nutch readdb -dump user/vagrant/dump -text ### won't work 197 XXX ./bin/nutch readdb -dump hdfs:///user/vagrant/dump -text ### won't work 198 ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump -text 199 200 201 USING THE SCRIPT TO ATTEMPT TO CRAWL A SITE 202 * Choosing to repeat the cycle 10 times because, as per http://nutch.sourceforge.net/docs/en/tutorial.html 203 204 "Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10." 205 206 * Use the ./bin/crawl script, provide the seed urls dir, the crawlId and number of times to repeat = 10 207 vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage 10 208 209 210 * View the downloaded crawls. 211 This time need to provide crawlId to readdb, in order to get a dump of its text contents: 212 hdfs dfs -rm -r hdfs://node2/user/vagrant/dump2 213 ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump2 -text -crawlId davidbHomePage 214 215 * View the contents: 216 hdfs dfs -cat hdfs://node2/user/vagrant/dump2/part-r-* 217 218 219 * FIND OUT NUMBER OF URLS DOWNLOADED FOR THE SITE: 220 vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage 221 WebTable statistics start 222 Statistics for WebTable: 223 retry 0: 44 224 status 5 (status_redir_perm): 4 225 status 3 (status_gone): 1 226 status 2 (status_fetched): 39 227 jobs: {[davidbHomePage]db_stats-job_local647846559_0001={jobName=[davidbHomePage]db_stats, jobID=job_local647846559_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=0, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=595591168}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788140, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}} 228 TOTAL urls: 44 229 max score: 1.0 230 avg score: 0.022727273 231 min score: 0.0 232 WebTable statistics: done 233 234 ------------------------------------ 235 STOPPING CONDITION 236 Seems inbuilt 237 * When I tell it to cycle 15 times, it stops after 6 cycles saying no more URLs to fetch: 238 239 vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage2 15 240 --- 241 No SOLRURL specified. Skipping indexing. 242 Injecting seed URLs 243 244 ... 245 246 Thu Oct 3 09:22:23 UTC 2019 : Iteration 6 of 15 247 Generating batchId 248 Generating a new fetchlist 249 ... 250 Generating batchId 251 Generating a new fetchlist 252 /home/vagrant/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId davidbHomePage2 -batchId 1570094569-27637 253 GeneratorJob: starting at 2019-10-03 09:22:49 254 GeneratorJob: Selecting best-scoring urls due for fetch. 255 GeneratorJob: starting 256 GeneratorJob: filtering: false 257 GeneratorJob: normalizing: false 258 GeneratorJob: topN: 50000 259 GeneratorJob: finished at 2019-10-03 09:22:52, time elapsed: 00:00:02 260 GeneratorJob: generated batch id: 1570094569-27637 containing 0 URLs 261 Generate returned 1 (no new segments created) 262 Escaping loop: no more URLs to fetch now 263 vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ 264 --- 265 266 * running readdb -stats show 44 URLs fetched, just as first time (when crawlId had been "davidbHomePage"): 267 268 vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage2 269 --- 270 WebTable statistics start 271 Statistics for WebTable: 272 retry 0: 44 273 status 5 (status_redir_perm): 4 274 status 3 (status_gone): 1 275 status 2 (status_fetched): 39 276 jobs: {[davidbHomePage2]db_stats-job_local985519583_0001={jobName=[davidbHomePage2]db_stats, jobID=job_local985519583_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=4, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=552599552}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788152, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}} 277 TOTAL urls: 44 278 --- 279 -
gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT
r33543 r33545 433 433 4. Since trying to go install the crawl url didn't work 434 434 https://stackoverflow.com/questions/14416275/error-cant-load-package-package-my-prog-found-packages-my-prog-and-main 435 [https://stackoverflow.com/questions/26694271/go-install-doesnt-create-any-bin-file] 435 436 436 437 vagrant@node2:~/go/src$
Note:
See TracChangeset
for help on using the changeset viewer.