Changeset 33428

Show
Ignore:
Timestamp:
19.08.2019 20:31:23 (4 weeks ago)
Author:
ak19
Message:

Working commoncrawl cc-warc-examples' WET wordcount example using Hadoop. And some more links.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33425 r33428  
     1To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics: 
     21. ssh analytics -Y 
     32. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y 
     4or 
     5vagrant ssh -- -Y node1 
     6(the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs) 
     7 
     8Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101 
     9- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/ 
     10- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1. 
     11 
     12 
     13 
     14 
     15WET example from https://github.com/commoncrawl/cc-warc-examples 
     16 
     17vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data 
     18vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/. 
     19vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data 
     20Found 1 items 
     21-rw-r--r--   1 vagrant supergroup  154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz 
     22vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount 
     23 
     24<ONCE FINISHED:> 
     25 
     26vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part* 
     27 
     28 
     29 
     30INFO ON HADOOP/HDFS: 
     31https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/ 
     32 
     33--------------- 
     34More examples to try: 
     35https://github.com/commoncrawl/cc-warc-examples 
     36 
     37 
     38A bit outdated? 
     39https://www.journaldev.com/20342/apache-spark-example-word-count-program-java 
     40https://www.journaldev.com/20261/apache-spark 
     41 
     42-------- 
     43 
    144sudo apt-get install maven 
    245(or sudo apt update